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INTRODUCTION 


This  SPIE  forum  marked  the  fifth  meeting  on  the  subject  and  the  continued  interest 
and  popularity  shown  by  the  attendance  of  the  real  time  signal  processing  com- 
munity indicates  that  for  the  foreseeable  future,  the  subject  will  continue  to  be  one 
of  those  covered  by  the  SPIE  technical  symposia.  Topics  range  from  concepts  and 
applications  through  subsystems  to  systems.  In  this  proceedings  system  users 
describe  their  needs,  constraints,  and  goals,  while  processing  technologists  dis- 
cuss the  applications  of  algorithms  and  computational  architectures.  Device 
developers  disclose  what  they  have  achieved  in  hardware. 

How  do  we  explain  the  continued  popularity  of  these  sessions  over  the  years? 
Don't  other  workshops,  symposia  or  meetings  cover  these  items  in  one  way  or 
another?  Yes,  they  do,  but  SPIE  has  created  something  unique.  First,  the  subject 
matter  is  not  limited  to  optical  signal  processing,  as  the  name  of  the  society  might 
suggest,  but  includes  digital  and  hybrid  techniques  as  well.  Second,  papers  are 
included  that  cover  completed  and  well-documented  research  together  with  pa- 
pers that  discuss  tentative  laboratory  results  which  are  as  recent  as  a  few  months. 
It  is  these  two  items  that  I  feel  account  for  the  popularity  of  these  meetings.  SPIE 
should  be  proud  that  it  is  able  to  provide  a  forum  that  can  accommodate  such  a 
broad  scope  under  one  umbrella. 

Joel  Trimble 

Office  of  Naval  Research 
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Progress  on  a  systolic  processor  implementation 

J.  J.  Symanski 

Naval  Ocean  Systems  Center,  San  Diego,  California  92152 


Abstract 

Parallel  algorithms  using  systolic  and  wavefront  processors  have  been  proposed  for  a  number  of  matrix  operations  important  for  signal 
processing;  namely,  matrix-vector  multiplication,  matrix  multiplication/addition,  linear  equation  solution,  least  squares  solution  via  orthogonal 
triangular  factorization,  and  singular  value  decomposition. 

In  principle,  such  systolic  and  wavefront  processors  should  greatly  facilitate  the  application  of  VLSI/VHSIC  technology  to  real-time 
signal  processing  by  providing  modular  parallelism  and  regularity  of  design  while  requiring  only  local  interconnects  and  simple  timing. 

In  order  to  validate  proposed  architectures  and  algorithms,  a  two-dimensional  systolic  array  testbed  has  been  designed  and  fabricated. 
The  array  has  programmable  processing  elements,  is  dynamically  reconfigurable,  and  will  perform  16-bit  and  32-bit  integer  and  32-bit  floating 
point  computations.  The  array  will  be  used  to  test  and  evaluate  algorithms  and  data  paths  for  future  implementation  in  VLSI/VHSIC  techno- 
logy. 

This  paper  gives  a  brief  system  overview,  a  description  of  the  array  hardware,  and  an  explanation  of  control  and  data  paths  in  the  array. 
The  software  system  and  a  matrix  multiplication  operation  are  also  presented. 

Introduction 

The  systolic  concept  involves  the  inherent  high  throughput  and  simplicity  offered  by  a  lattice  of  identical  processing  elements,  all 

operating  in  parallel  on  data  flowing  through  the  structure. ' '~  The  prospects  for  fabricating  an  array  element  (or  several  elements)  on  a 
single  chip  appear  very  good.  However,  many  details  of  the  algorithms,  data  flow,  control,  input/output,  numerical  accuracy,  speed,  etc.  have 
to  be  determined  before  a  particular  chip  design  can  be  undertaken. 

The  goal  of  this  work  is  to  build  a  systolic  array  testbed  which  is  flexible  enough  to  allow  experimentation  with  algorithms  and  con- 
figurations so  that  intelligent  decisions  can  be  made  when  it  comes  time  to  specify  the  chip  architecture  for  a  particular  set  of  applications. 
The  testbed  was  not  designed  for  optimum  architecture  for  a  specific  application,  but  for  flexibility  in  evaluation  of  various  algorithms  and 
architectures  which  may  be  implemented  in  the  future.  There  are  input/output  limitations  due  to  the  single  I/O  channel  from  the  host  system 
(a  common  problem  for  array  processors)  as  well  as  the  speed  limitation  set  by  the  relatively  slow  arithmetic  processing  unit. 

System  Overview 

The  systolic  array  testbed  system  is  composed  of  a  minicomputer  system  interfaced  to  the  array  of  systolic  processor  elements  (SPEs). 
The  host  is  a  minicomputer  with  the  usual  complement  of  printer,  disc  storage,  keyboard-CRT,  etc.  The  systolic  array  is  housed  in  a  cabinet 
approximately  28  by  19  by  21  inches.  (See  Figure  1.)  The  interface  circuitry  uses  a  single  16-bit  data  path  from  the  host  minicomputer  to 
communicate  data  and  commands  to  the  array. 

Commands  and  data  are  generated  in  the  host  by  the  operator,  using  interface  programs  written  in  FORTAN  or  PASCAL.  Algorithms 
can  be  conceived,  put  into  a  series  of  commands  for  the  systolic  array  processor,  and  tested  for  validity.  Data  computed  in  the  array  can  be 
read  by  the  host  minicomputer  and  displayed  for  the  operator. 

Many  other  papers  have  discussed  the  theoretical  aspects  of  the  systolic  array's  communication  of  data  and  other  properties.  ''3,4,5 
We  have  implemented  the  original  H.  T.  Kung  hexagonal  interconnect  architecture,  as  shown  in  reference  2.  By  substitution  of  squares  for 
hexagons,  appropriate  rotation  of  the  communication  paths,  and  realignment  of  the  processors  on  a  square  grid,  we  have  the  square  array. 
Now  the  A  data  paths  are  horizontal,  the  B  paths  are  vertical,  and  the  C  paths  are  along  a  diagonal. 

In  this  testbed,  there  are  virtual  rows  and  columns  along  the  edges  of  the  array  as  shown  in  Figure  2.  These  virtual  rows  and  columns 
perform  the  interfacing  between  the  parallel  data  path  from  the  host  and  the  serial  communications  of  the  systolic  array  processing  elements. 
These  virtual  rows  and  columns  can  be  thought  of  as  existing  on  either  side  of  the  array,  since  the  data  paths  are  just  8-bit  shift  registers 

connected  in  a  circular  manner. 

To  achieve  architectural  flexibility,  the  serial  data  paths  into  the  array  pass  through  multiplexers  so  that  several  options  for  data  flow 
are  possible.  The  data  path  can  be  selected  in  real  time  by  the  host  processor.  The  various  configurations  and  their  uses  are  discussed  in 
reference  4. 

Figure  3  shows  a  block  diagram  of  the  host  interface.  Commands  and  data  for  the  array  are  sent  to  the  interface  and  interpreted  by  the 
control  logic.  Control  lines  and  clocks  are  generated  and  sent  to  the  array  as  appropriate. 

The  SPE  block  diagram  is  shown  in  Figure  4.  The  microprocessor  is  an  Intel  8031.  The  RAM  and  EPROM  are  standard  devices.  The 
Arithmetic  Processing  Unit  (APU)  is  the  Intel  823 1.  The  I/O  section  contains  four  universal  8-bit  shift  registers.  The  A,  B  and  C  registers  are 
for  data.  The  S  register  is  used  to  control  the  operation  of  the  SPE.  (See  references  3  and  4.) 
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Figure  3.  Systolic  array  host  interface  block  diagram. 


Figure  4.  Systolic  processor  element  block  diagram. 


It  is  important  to  note  that  the  flexibility  of  this  testbed  is  obatined  through  the  use  of  a  microprocessor  in  the  SPE.  thus 
allowing  us  to  interchange  the  roles  of  A,  B,  and  C  at  will. 
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Array  Hardware 


The  assembled  8-by-8  array,  with  the  interface  electronics,  is  shown  in  Figure  1.  The  SPEs  mount  directly  onto  the  16  x  23  inch 
motherboard  with  edgecard  connectors.  All  signals  to  the  array,  from  the  interface  circuitry  in  the  rack  below  the  array,  are  carried  by  four 
flat  cables.  The  power  supplies  are  mounted  behind  the  array.  The  interface  circuitry  consists  of  four  wire-wrap  boards  with  about  60  TTL 
ICs  on  each  board. 

Layout  of  circuitry  can  be  a  problem  in  printed  circuit  board  design  as  well  as  VLSI,  but  is  usually  not  as  critical.  Also,  it  is  always 
good  practice  to  make  the  layout  as  regular  and  logical  as  possible.  This  is  one  of  the  strong  points  of  systolic  arrays;  i.e.,  the  regularity  of 
interconnections.  Figure  5  shows  a  2-by-2  portion  of  the  array.  Note  the  regularity  of  the  pattern.  There  are  120  signal  lines  from  the  inter- 
face circuitry  to  the  array.  Signal  lines  were  made  as  wide  as  possible  and  ground  planes  were  used  generously.  Two  ground  returns  are  used, 
one  for  power  and  one  for  signal. 

The  SPE  board  is  shown  in  Figure  6.  It  is  a  four  layer  board.  The  two  internal  layers  are  used  for  power  and  ground  only.  Here  again, 
there  are  two  ground  patterns,  one  for  the  return  of  power  and  one  for  signal  grounding  and  noise  shielding. 

There  are  18  chips  on  the  SPE  board.  The  most  expensive  parts,  the  Arithmetic  Processor  ($100),  the  8051  microprocessor  ($18),  and 
the  2732  EPROM  ($12),  are  on  sockets.  The  other  ICs  are  standard  TTL  devices  ($22  total).  The  printed  circuit  board  costs  about  $25,  not 
including  the  cost  of  layout.  Total  cost  for  the  SPE  components  is  about  $  1 90. 

Assembly  of  the  system  has  had  the  usual  problems  which  show  up  when  putting  together  a  complex  system.  The  SPE  was  fabricated 
in  wire-wrap  form  in  order  to  check  its  operation  and  develop  code  for  the  805 1  microprocessor.  This  was  done  on  a  separate  system  inde- 
pendent of  the  array  host  and  interface  system.  Then  the  array  interface  was  assembled  and  interfaced  to  the  host.  Software  was  used  to 
verify  the  interface  operation  before  SPEs  were  connected  to  the  array.  Then  a  wire-wrapped  3-by-3  array  of  SPEs  (printed  circuit  board  pro- 
totypes) was  connected  to  the  interface  electronics.  Software  was  again  used  to  verify  operation  of  the  whole  system.  Finally  the  full  8-by-8 
array  was  assembled.  Software  programs  were  generated  to  test  the  full  array.  The  array  is  now  fully  functional.  A  matrix  multiply  imple- 
mented on  the  system  will  be  described  later. 

Control  Flow 

The  individual  processors  are  identical  with  respect  to  hardware  and  program  in  EPROM.  Referring  to  Figure  4,  the  A,  and  B  and  C 
registers  are  used  for  data.  The  S  register  is  used  by  the  host  to  store  a  value  (8  bits)  which  is  interpreted  by  the  SPE's  microprocessor  as  a 
command  to  perform  a  certain  operation.  Examples  of  operations  are:  move  the  byte  in  the  A  register  into  memory,  multiply  the  stored 
values  of  A  and  B,  change  shift  direction  of  the  registers,  etc.  (There  are  presently  about  1  20  commands  for  data  movement  and  computations 
programmed  into  the  EPROM,  utilizing  about  half  of  the  4K  bytes  available.)  These  S  registers  are  connected  to  the  interface  as  shown  in 
Figure  7  for  a  3-by-3  array.  Each  row  of  the  array  has  an  S  register  within  the  interface.  The  register  for  a  given  row  broadcasts  the  same 


Figure  5.  Two-by-two  section  of  array  backplane.  Figure  6.  SPE  board. 
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Figure  7.  Systolic  array  control  structure. 


byte  to  each  S  register  in  that  row.  The  eight  S  registers  within  the  interface  logic  can  be  loaded  with  the  same  or  different  command  bytes. 
Thus,  loading  rows  with  different  commands  will  cause  different  operations  in  SPEs  in  different  rows. 


There  is  a  further  possibility  for  control  of  individual  SPEs  called  the  processor  select.  On  each  SPE  there  are  four  lines  which  control 
the  input  clock  to  the  S  register.  Two  lines  determine  the  Select  Code  which  controls  the  4-to-l  multiplexer  which  selects  one  of  four  inputs. 
The  other  two  are  Row  Enable  and  Column  Enable.  Table  1  shows  the  selection  possibilities.  Select  code  3  ANDs  the  Row  Enable  and 
Column  Enable  inputs  to  a  SPE.  Thus,  by  using  the  Row  and  Column  select  mode,  any  individual  SPE  can  be  selected  for  an  operation  dif- 
ferent from  the  operation  performed  by  all  the  other  SPEs. 

Note  that  there  is  also  an  output  line  from  the  SPE  to  the  registers  in  the  interface  logic.  These  lines  are  open  collector  driven  to  avoid 
conflicts.  With  appropriate  control,  this  can  be  used  to  pass  error  status  or  other  information  back  to  the  controller  and  then  to  the  host. 

The  S  register  can  also  be  used  for  other  purposes.  For  instance,  all  the  S  registers  in  the  interface  can  be  loaded  with  the  same  8-bit 
data.  With  the  appropriate  routine  in  EPROM,  the  same  data  can  be  broadcast  to  the  whole  array  instead  of  shifting  in  the  data  through  each 
SPE.  This  method  can  also  be  used  to  broadcast  new  routines  to  be  stored  in  the  SPE  RAM  for  testing.  Furthermore,  by  loading  each  S 
register  with  different  values,  data  can  be  broadcast  to  all  SPEs  along  a  particular  row. 


Data  Movement 


The  SPE  performs  calculations  using  32-  or  16-bit  values.  However,  all  data  are  moved  in  bytes  to  and  from  the  array  as  well  as  within 
the  SPE.  The  host  formats  the  32-bit  values  into  the  host  I/O  buffer.  Once  the  data  have  been  transferred  into  the  array  data  buffer,  com- 
mands from  the  host  move  the  data  from  the  buffer  into  the  appropriate  parallel-to-serial  registers  in  the  virtual  rows  and  columns.  Subse- 
quent commands  shift  data  into  the  array  as  well  as  within  the  array.  A  command  byte  is  put  into  the  S  registers  of  the  SPEs  and  interrupt 
broadcast  to  all  the  SPEs.  This  causes  the  microprocessor  in  the  SPE  to  store  the  data  byte  in  the  appropriate  storage  location  in  the  SPE 
RAM.  The  register  load,  shift,  and  data  storage  cycle  is  repeated  four  times  for  each  32-bit  value.  Subsequent  instructions  from  the  host, 
using  the  control  structure  discussed  earlier,  initiate  computations  on  the  data  as  desired. 

As  discussed  in  reference  4,  there  are  several  array  configurations  possible  with  this  testbed,  for  instance;  linear,  square,  transpose,  dual 
array,  and  broadcast  configurations.  These  configurations  can  be  achieved  dynamically  under  the  host  control.  This  is  done  with  a  multi- 
plexer which  selects  one  of  several  sources  for  output  from  the  interface  to  the  left  column  and  also  the  top  row  of  the  array.  Table  2  shows 
the  sources  for  the  various  configurations.  Other  sources  may  be  used  in  the  future  to  enhance  the  operation  of  particular  algorithms. 


Software  System 


The  software  system  for  the  systolic  array  testbed  can  be  thought  of  in  three  parts:  (1)  data  entry  and  display,  (2)  programming  of 
the  array,  and  (3)  execution  of  the  program.  The  host  minicomputer  has  FORTRAN,  PASCAL,  and  various  library  programs  available.  The 
three  functions  are  described  below. 

The  data  entry  function  will  allow  matrices  of  numbers  to  be  generated  in  several  ways:  (1)  by  operator  entry,  (2)  by  loading  from 
magnetic  tape,  and  (3)  by  generation  within  the  host  itself.  The  matrix  data  generated  by  one  of  the  above  methods  are  formatted  by  the 
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Table  1 


Table  2.  Reconfiguration  Multiplexer  Inputs  and  Uses 
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host  in  a  manner  which  allows  efficient,  speedy  transmission  to  the  systolic  array  processor  during  program  execution,  and  then  stored  on 
disc  for  future  use.  A  file  management  system  is  used  to  keep  track  of  the  various  matrices.  Also  stored  with  the  matrix  data  is  a  verbal 
description  of  the  properties,  of,  or  use  for,  that  matrix. 

There  will  be  a  structured  approach  to  the  programming  of  the  array  in  that  object  code  modules  can  be  no  larger  than  256  sixteen-bit 
words.  This  forces  the  programmer  to  break  up  long  operations  into  short,  easily  understood  modules.  The  use  of  "calls"  in  the  source  code 
enables  very  long  programs  to  be  run  with  short  modules. 

Programming  the  array  is  done  in  a  manner  similar  to  that  used  for  the  host  minicomputer.  The  standard  text  editor  available  for  writ- 
ing the  FORTRAN  and  PASCAL  programs  is  used  to  write  the  source  code  for  the  systolic  array.  This  has  the  advantage  of  familiarity  and 
allowing  commenting  of  the  systolic  processor  programs  to  more  easily  understand  the  operations  performed  by  the  array.  Once  source  code 
is  written,  a  compiler  program  will  generate  object  code  for  the  array,  which  is  a  file  of  16-bit  words  sent  to  the  array  as  instructions.  The 
various  programs,  or  modules,  written  for  the  systolic  array  are  also  listed  in  a  directory  for  ready  access  by  the  operator. 

The  execution  of  programs  on  the  systolic  array  is  accomplished  as  follows.  The  operator  runs  a  "main"  module  which  calls  several 
other  modules.  A  "loader"  program  gets  the  main  module  and  links  it  with  the  called  modules  as  well  as  any  data  files  required.  All  the 
instructions  are  loaded  into  a  buffer  for  output  to  the  systolic  array  or  stored  for  future  use  as  a  single  operation.  The  loader  will  build  the 
program  until  (a)  the  buffer  size  is  exceeded,  (2)  the  end  of  the  program,  or  (3)  an  input  is  required.  If  the  program  calls  for  input,  the  object 
code  up  to  that  point  is  sent  to  the  array  (the  array  performs  the  operations)  and  the  host  inputs  the  results  from  the  array.  The  loader  con- 
tinues in  this  manner  unitl  the  program  is  completed.  Results  received  from  the  systolic  array  are  stored  on  disc  for  later  display  and  checking. 

Matrix  Multiplication 


The  matrix  multiplication  implemented  uses  an  in-place  accumulation  of  partial  sums.    The  elements  of  the  array  are  skewed  during 
input  to  the  array  boundary  processors.  (See  Figure  8.)  As  the  rows  and  columns  step  through  each  SPE,  the  sum  of  products  accumulates 
in-place. 

The  A  and  B  matrices  are  generated  in  the  host  by  one  of  the  previously  discussed  methods.  The  host  then  formats  the  data  for  output 
to  the  data  buffer  in  the  array  interface.  Then  the  host  sends  commands  to  the  interface  logic  which  moves  the  appropriate  data  elements 
from  the  data  buffer  into  the  peripheral  processors  of  the  array.  As  the  program  proceeds,  commands  cause  data  in  each  SPE  to  be  multiplied 
and  summed  as  well  as  the  original  data  to  be  moved  to  the  next  SPE.  When  the  required  cycles  of  shift  data,  multiply,  and  add  have  been 
completed,  the  matrix  product  of  A  and  B  resides  in  a  third  C  matrix  which  is  simply  one  of  the  storage  locations  of  the  SPE  RAM.  (There 
are  256  such  locations  available.) 

The  programming  of  an  8-by-8  matrix  multiplication  is  relatively  straight-forward.  However,  other  operations  such  as  least  squares  and 
eigensystem  problems,  especially  in  partitioned  matrix  form  (when  dealing  with  matrices  larger  than  the  array)  may  be  quite  difficult.  That  is, 
the  selection  of  data  elements  for  input  to  the  array  and  movement  within  the  array  will  require  great  care.  Software  tools  which  show  the 
data  residing  in  each  SPE  at  a  particular  point  in  the  algorithm  will  be  essential. 

In  future  systems  this  data  formatting  and  movement  will  utilize  a  high  speed  controller  with  routines  in  ROM  which  will  quickly 
select  and  move  data  in  and  out  of  the  array  for  a  variety  of  algorithms.  Or,  more  advanced  architectures,  with  greater  parallelism  in  I/O,  will 
be  developed  to  alleviate  the  I/O  bottleneck  which  can  degrade  the  operation  of  array  processors. 

Conclusions 

A  versatile  two  dimensional  systolic  array  testbed  has  been  designed,  fabricated  and  programmed  to  perform  matrix  operations.  While 
not  an  optimum  architecture  for  real-world  applications,  this  testbed  will  allow  experimentation  with  many  array  configurations  and 
algorithms.  The  experience  gained  in  implementing  a  wide  variety  of  algorithms  will  have  benefits  in  both  algorithm  and  hardware  design.  In 
the  process  of  trying  to  create  efficient,  fast,  understandable  algorithms,  we  will  also  find  the  capabilities  that  the  hardware  should  have  in 
data  paths,  communication  of  data  and  commands,  processing  power,  memory,  etc.  This  experience  will  enable  the  timely  marriage  of  opti- 
mized algorithms  and  hardware  for  solving  complicated  signal-processing  problems  with  VLSI.  In  2  or  3  years,  a  single  chip,  instead  of  the 
present  printed  circuit  board,  could  be  designed  to  communicate  in  32-bit  numbers  and  perform  32-bit  floating-point  mathematical  operations 
in  a  few  microseconds  instead  of  the  200-300  microseconds  required  by  the  present  implementation.  Control  and  input/output  concepts  will 
also  be  required  for  efficient  utilization  of  VLSI  chips  in  a  broad  range  of  applications.  This  testbed  will  provide  insight  into  many  of  these 
problems. 
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Figure  8.  Multiplication  of  full  matrices  by  an  engagement  processor. 
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Introduction 

Between  the  conception  of  a  real  time  signal  processor  and  its  functional,  VLSI  realiza- 
tion there  is  an  enormous  amount  of  effort  devoted  to  designing,   revising,  optimizing  and 
testing.     Since  the  process  is  cumulative  —  later  work  builds  on  previous  work  —  and 
since  the  activity  becomes  progressively  more  detailed,  more  constrained  and  more  exacting, 
it  follows  that  the  global  design  parameters  should  be  fully  explored.     Global  design 
decisions,  when  correct,   can  have  a  greater  effect  on  performance  than  many  local  optimiza- 
tions.    When  the  decisions  are  wrong,   they  can  cause  continual  difficulty.     Accordingly,  we 
propose  a  design  methodology  based  on  the  Configurable,   Highly  Parallel    (CHiP)  architecture 
family1   that  focuses  on  exploring  global  design  parameters  and  is  especially  well  suited  to 
the  VLSI  implementation  of  signal  processing  systems. 

The  characteristic  that  distinguishes  digital  signal  processing  design  problems  from 
other  large  VLSI  design  problems,   e.g.,  microprocessor  design,   is  that  the  former  tend  to 
require  the  assembly  of  a  large  number  of  identical  components  while  the  latter  often 
require  the  assembly  of  a  diverse  collection  of  components.     In  terms  of  the  widely 
discussed  hierarchical  design  methodology 2,~k   this  distinction  means  that  signal  processors 
are  characterized  by  a  shallow  hierarchy  rather  than  a  deep  hierarchy.     The  emphasis  on 
decomposition  in  the  hierarchical  design  methodology,  with  its  resulting  deep  hierarchy, 
provides  less  leverage  for  signal  processing  design  problems.     Our  CHiP  computer  methodology, 
though  hierarchical,  emphasizes  the  layout  of  homogeneous  components  and  should  provide 
greater  leverage  for  signal  processor  design  situations. 

The  methodology  is  not  a  cookbook  procedure.     That  is,   there  is  not  a  sequence  of 
definite  steps  which  if  followed  from  start  to  finish  result  in  a  real  time  signal 
processor.     But  there  are  steps:     the  designer  programs  the  algorithm  for  a  CHiP  computer, 
tests  it,   assesses  the  design,  revises  it,  programs  the  subparts,   tests  them,  assesses 
their  design,   revises  them  and,   finally,   specializes  the  entire  system  for  silicon 
implementation . 

In  order  to  organize  our  presentation  of  the  methodology,  we  will  develop  a  design  as  a 
running  example.     Our  problem  will  be  to  design  a  pipelined,  eight  point  Fast  Fourier 
Transform  processor.     The  reader  need  not  be  acquainted  with  the  FFT ,   since  our  intent  is 
not  to  produce  a  practical  device.     Rather  we  are  using  the  problem  as  a  context  in  which 
to  focus  on  the  design  activity. 

Problem  statement 

Naturally,   the  first  step  in  any  design  situation  is  to  understand  the  problem.     For  our 
running  FFT  example  this  can  be  conveniently  stated  with  a  schematic  diagram,    (Figure  1) . 
Each  processing  element  takes  two  inputs,  B  and  B'   and  computes  two  weighted  sums, 
B  +  QB'   and  B  -  QB 1 .      (See  Stone 5   for  exact  details.)     Our  assumptions  are  that  the 
processors  receive  data  bit-serially  from  off  the  chip,   that  the  structure  is  pipelined  and 
that  the  resulting  circuit  is  to  be  placed  on  a  single  chip.     From  these  assumptions,  we 
conclude  that  we  will  need  to  place  twelve  processors  each  capable  of  multiplying  by  a 
constant  and  adding,   and  that  the  chip  will  require  sixteen  pins  for  data  in  addition  to 
power,   ground  and  any  control  lines. 

Programming  the  algorithm 

The  next  step  in  the  methodology  is  to  program  the  algorithm  for  a  CHiP  computer.  The 
purpose  is  to  establish  an  unambiguous  specification  of  the  problem  and  to  begin  initial 
exploration  of  the  layout,   timing  and  input/output  constraints.     Before  programming  our  FFT 
example,  we  must  introduce  CHiP  machines. 


The  work  described  herein  is  part  of  the  Blue  CHiP  Project  and  is  supported  in  part  by  the 
Office  of  Naval  Research  Contracts  N00014-80-K-0816  and  N00014-81-K-0360 .     The  latter  is 
Special  Research  Opportunities  Task  SRO-100. 
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Figure  1. 


Pipelined  FFT  schematic. 


A  CHiP  computer  is  one  of  a  family  of  architectures  specialized  for  "fine-grained" 
parallelism  and  efficient  VLSI  implementation.     The  main  component  of  the  architecture  (and 
the  only  one  of  interest  here*)    is  the  switch  lattice.     This  is  a  homogeneous  array  of 
programmable  switches  and  data  paths  with  processing  elements  placed  at  regular  intervals. 
Figure  2  illustrates  schematic  diagrams  of  two  switch  lattices.     The  switches  and  data 
paths  are  a  general  means  of  specifying  information  flow  and  the  processing  elements  serve 
to  represent  some  arbitrary  computational  activity. 


Ultimately,  when  the  methodology  has  been  worked  through  and  the  design  is  completed, 
the  switches  will  have  been  removed,  the  active  data  paths  will  have  been  replaced  by  wires 
and  the  processing  elements  will  have  been  replaced  by  specialized  circuits  for  the 
particular  function.     But  at  this  point,   this  stylized  representation  of  the  components 
gives  the  designer  a  simple,   flexible  means  of  simulating  the  algorithm.     The  simplicity 
and  flexibility  make  the  revision  a  less  painful  process  and  encourage  exploration  and 
experimentation . 

As  Figure  2  illustrates,   switch  lattices  differ  in  several  respects.     Although  the 
designer  will  choose  a  lattice  that  is  suitable  for  the  particular  algorithm,   it  is 
appropriate  to  mention  the  axes  of  variability.     The  degree,  d,   of  switches  and  processing 


*Other,  more  thorough  descriptions  of  the  CHiP  machine  have  been  given,   but  they  focus  on 
its  use  as  a  general  purpose  parallel  processor1'6.     Our  description  here  has.  been 
soecialized  to  its  use  as  a  design  abstraction. 
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elements  refers  to  the  number  of  data  paths  incident  to  the  device.     Normally,  we  will  have 
d=8  for  switches  and  PEs  although  a  higher  degree  for  PEs  may  be  convenient  when  there  are 
multiple  inputs  and  outputs.      (See  below.)      In  Figure  2(a),  d=8  and  in  Figure  2(b),  d=4 . 

The  corridor  width  w  refers  to  the  number  of  switches  separating  two  neighboring  PEs. 
(In  Figure  2(a) ,  w=l;   in  2(b) ,  w=2 . )     The  more  distinct  data  paths  that  must  pass  between 
two  processing  elements,   the  wider  the  corridor  width  must  be.     Since  the  switches  will 
ultimately  be  removed,   there  is  no  harm  in  specifying  a  large  corridor  width.     However,  by 
calling  explicit  attention  to  corridor  width,  we  cause  the  designer  to  focus  on  data  routing 
and  to  appreciate  the  consequences  of  haphazard  routing  on  density  and  packing.     Notice  that 
the  corridor  width  is  related  to  the  number  of  distinct  data  paths  passing  between  two  PEs, 
not  to  the  number  of  wires  in  each  data  path   (which  is  set  later) . 

One  programs  the  switch  lattice  simulator  by  giving     "configuration  settings"  for  the 
switches  and  program  text  for  the  PEs.     A  configuration  setting  specifies  which  of  the 
incident  data  paths  a  switch  is  to  connect.     If  no  configuration  setting  is  given  the  data 
paths  are  isolated.     In  the  figures  we  simply  draw  lines  through  switches  to  specify  active 
settings.     The  program  text  is  given  in  a  conventional  sequential  programming  language  that 
has  been  extended  with  facilities  to  specify  timing.* 

Returning  to  our  FFT  example,  we  can  specify  our  first  embedding.     Figure  3  illustrates 
a  direct  embedding  of  the  FFT  interconnection    (Figure  1)    in  a  switch  lattice  where  w=2  and 
d=8 .     Because  of  the  number  of  data  paths  crossing  from  the  upper  half  of  the  layout  to  the 
lower  half,   a  width  w=2  is  required.     Notice  that  the  layout  is  the  same  for  each  of  the 
three  files. 


Figure  3.     Switch  lattice  embedding  of  the  FFT. 

The  execution  of  the  CHiP  computer  is  synchronous,  so  the  development  of  the  PE  code  is 
a  simple  matter.     Each  PE  executes  a  variant  of 

L:      READ  B;    READ  B' 
C  f-  B  +  QB' 
C    <-  B  -  QB* 
WRITE  C;    WRITE  C 
GOTO  L 

where  the  variant  is  determined  by  which  PE  ports  the  variable  comes  from  or  goes  to.  For 
example,   PE  1.1  would  execute 


*For  the  Blue  CHiP  Project's  pilot  simulator,  the  language  is  Pascal. 
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L:     READ  B  FROM  West;   READ  B'   FROM  Southwest 
C  +-  B  +  QB' 
C   <-  B  -  QB 1 

WRITE  C  TO  East;   WRITE  C   TO  Southwest 
GOTO  L 

PEs  with  degree  greater  than  eight  have  their  ports  numbered. 

Although  the  development  of  the  program  is  the  responsibility  of  the  designer,  there  are 
library  embeddings  available  that  embody  careful  analysis  and  research. 

Assessment  and  revision 

The  next  activity  in  the  methodology  is  to  assess  the  initial  design  and  make  appropriate 
improvements.     The  goal  here  is  to  evaluate  how  the  design  can  be  globally  improved  before 
investing  any  effort  in  the  detailed  layout.     Obviously,   this  activity  will  require  a 
certain  amount  of  judgement  and  experience. 

Our  FFT  has  several  favorable  characteristics.     It  has  a  nearly  square  aspect  ratio  (4:3) 
and  has  edge-to-edge  data  flow.     The  latter  property  is  important  in  order  to  reach  the 
bonding  pads  which  are  most  conveniently  located  on  the  perimeter.     The  main  liability  of 
our  initial  design  is  the  nonlocal  data  flow,   i.e.,   the  presence  of  long  data  paths.  When 
the  design  is  laid  out,   some  wires  will  have  to  be  as  long  as  the  side  of  a  PE. 


To  solve  this 
it  is  not  necess 
particular,   an  a 
and  then  back  ou 
the  second  file 
and  third  files 
which  still  has 
optimization  may 
in-and-out,  edge 


long  data  path  problem,  we  obse 
ary  for  the  flow  to  be  unidirect 
lternative  strategy  is  to  route 
t  towards  the  perimeter.  To  ach 
(2.x)  of  processing  elements  in 
around  the  edge.  Figure  4  illus 
edge-to-edge  data  flow  and  short 
not  generalize  for  larger  shuff 
to-edqe  data  flow  could  have  wi 


rve  that  to  achieve  edge-to-edge  data  flow, 
ional  as  it  is  in  our  initial  design.  In 
the  data  towards  the  center  of  the  layout 
ieve  such  an  in-and-out  data  flow,  we  place 
the  center  of  the  layout  and  place  the  first 
trates  this  layout.     The  result  is  a  design 
,   local  connections.      (This  particular 
le  graph  problems,  but  the  concept  of 
de  application.) 
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Figure  4.     A  revised  FFT  embedding  with  local  communication, 


The  assessment  and  revision  activity  is  iterated. 

In  the  second  design  the  aspect  ratio  is  now  square  —  a  minor  improvement.  Unfortunately, 
the  corners  of  the  layout  are  unused.     This  area  can  be  used  for  bonding  pads  for  the 
input/output  wires  of  the  adjacent  PEs.     It  could  also  be  used  for  other  logic  depending  on 
how  the  design  develops.      (See  below.) 

When  studying  the  way  data  enters  and  leaves  the  PEs  in  Figure  4,  one  sees  that  there  are 
two  different  processing  element  geometries:     The  external  PEs  are  alike  and  the  internal 
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PEs  are  alike 
function,  so 
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we  reprogram  the  external 
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rable  to  have  to  require  two  designs  for  the  same 
switches  to  convert  the  external  PE  geometries  to 
is  gives  one  layout  form.     Furthermore,   if  we  reflect 
ly  be  implemented,   it  is  clear  that  since  the  two 

together,   the  global  data  flow  will  be  optimized  if 
these  entry  and  exit  points  are  on  opposite  corners 
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to  this  Doint  to  reassess  and  revise. 
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Figure  5.     A  reprogramming  of  perimeter  interconnections. 


Round  two 

The  process  of  programming  the  CHiP  machine  has  resulted  in  an  unambiguous  specification 
of  the  algorithm,   a  routing  of  the  data  flow,   a  global  layout  and,  presumably,   the  develop- 
ment of  some  test  data  that  was  used  when  the  algorithm  was  run  on  the  CHiP  architecture 
simulator.     But  this  first  program  is  not  intended  to  specify  the  algorithm  in  great  enough 
detail  for  direct  VLSI  design  and  layout.     In  particular,   the  functional  activity  of  the  PEs 
is  probably  too  complicated  at  this  early  stage.     In  our  FFT  example,   the  inner-product  step 
is  such  a  complicated  activity. 

So  the  methodology  dictates  that  we  iterate  the  program-assess-revise  cycle  until  the 
functional  activity  of  the  PEs  is  sufficiently  simple  to  be  directly  implemented  in  VLSI  or 
can  be  implemented  by  an  available  library  layout.     Since  the  interconnection  and  global 
layout  are  now  fixed,   it  is  necessary  only  to  implement  the  specified  activity  of  the  PE. 
This  is  accomplished  by  programming  the  CHiP  architecture  to  implement  the  algorithm 
specified  by  the  PE  code(s).     It  is  this  iterative  activity  that  gives  the  methodology  its 
hierarchical  capability. 

During  each  subsequent  round  of  programming-assessment-revision ,   it  is  important  to 
establish  that  the  current  CHiP  program  correctly  implements  the  specification  of  the 
previous  level.     This  is  a  requirement  of  any  top-down  design  effort,   and  it  is  aided  here 
by  the  previously  developed  test  data.      (Notice  that  the  test  data  may  have  to  have  its 
form  changed  to  reflect  the  changed  level  of  detail.     For  example,   at  one  level  the  program 
can  be  simulated  on  words  of  data  while  at  the  next  level  it  might  require  bit-serial  data.) 

We  return  briefly  to  the  FFT  example  to  give  a  second  level  of  layout.     Postulate  a 
linear  array  of  PEs  to  perform  the  inner  product  step  based  on  a  pipelined  multiplier7.  The 
layout  will  have  two  serial  inputs,  B  and  B',   and  will  produce  two  serial  outputs,  B  +  QB ' 
and  B  -  QB ' .     The  coefficient,   Q,  will  be  stored  internally  to  the  layout,   although  it  will 
be  shifted  through  to  form  the  intermediate  products.     By  our  analysis  from  the  previous 
level,   the  current  layout  will  receive  its  input  at  one  corner  and  must  deliver  its  output 
to  the  opposite  corner.     This  suggests  a   "snaked"  arrangement  for  the  linear  array  of 


*The  internal  PEs  are  not  quite  alike  —  the  (clockwise)  meaning  of  the  data  paths  differs 
among  them.     This  will  be  easily  corrected  later  by  a  simple  wire  crossover. 
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processing  elements.      (See  Figure  6.)     Each  PE  has  as  input  and  output  the  three  data 
values  as  well  as  the  partial  product.     The  B  value  is  carried  along  to  be  available  at  the 
end  for  summing  and  differencing  in  the  last  cell.     The  control  lines  could  either  be 
broadcast  or  transmitted  sequentially 7  from  the  control  circuit  that  we  will  place  in  the 
corners  of  the  global  design. 
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Figure  6.     A  pipelined  inner  product  layout. 


As  before  we  should  next  program  the  activity  of  the  PEs.     This  time  there  will  be  a  few 
different  cells  since  the  multiplier  requires  a  few7  and  there  is  a  sum/difference  cell. 
Then  we  embark  on  another  sequence  of  assess  and  revise  iterations.     Having  illustrated  how 
the   "opposite  corner"  data  flow  property  established  at  the  top  level  becomes  a  constraint 
to  be  implemented  at  the  second  level,  we  forego  further  detailed  design. 

Design  specialization 

The  program-assess-revise  cycle  continues  until  the  processing  performed  at  each  PE  can 
be  directly  implemented  as  a  VLSI  design.     The  needed  cells  are  either  produced  or  acquired 
from  a  library.     Then  the  design  is  specialized .     That  is,   the  VLSI  designs  replace  the  PEs 
in  the  last  CHiP  program  layout.     The  active  data  paths  are  replaced  by  wires  and  all  of  the 
switches  are  removed.     This  result  is  then  used  to  specialize  the  next  higher  level  program, 
i.e.,   it  replaces  the  PEs  in  its  predecessor  layout,  etc.     When  the  activity  is  completed, 
our  stylized  CHiP  lattice  is  gone  and  what  remains  is  a  completed  VLSI  design.     For  our 
example,   see  the  schematic  in  Figure  7. 

Although  it  is  straightforward, the  specialization  process  is  not  quite  as  trivial  as 
just  suggested.     Its  success  depends  on  several  conditions.     First,   the  aspect  ratios  and 
cell  sizes  must  be  properly  controlled  during  the  design  process  in  order  to  pack  the  cells 
easily.     This  condition  is  easily  met  as  long  as  the  PEs  perform  closely  related  operations. 
In  our  running  example,   the  top  level  cells  were  identical;   the  second  level  cells  were 
sufficiently  similar  to  justify  an  assumption  of  equal  size. 

Another  complication  for  specialization  is  power  and  ground  routing.     We  recommend  the 
following  strategy.     Perform  the  routing  prior  to  specialization  but  after  all  the  VLSI 
cells  are  designed.     At  that  point  it  is  known,   relatively,  where  power  and  ground  enter 
the  cells.     Then,   route  the  power  and  ground  wires  within  each  CHiP  lattice  layout  starting 
at  the  top  level.     This  permits  a  convenient  top-down  routing  with  the  added  advantage  of 
knowing  the  target  sites  for  the  bottom  level  connections. 

A  word  about  simulation.     As  the  program-assess-revise  cycle  is  performed,  each  program 
can  be  simulated  in  isolation  using  the  data   (possibly  revised)    from  the  previous  level. 
Moreover,   the  composit  design  can  be  simulated  at  each  cycle  by  logically  substituting  the 
programs  of  each  level  for  the  PEs  of  the  previous  level.     Once  the  PEs  have  been  replaced 
by  VLSI  cells,  however,   it  is  unclear  to  what  extent  the  design  methodology  can  assist  in 
efficient  simulation.     It  is  obviously  compatible  with  hierarchically-based  VLSI  design 
rule  checking8   and  electrical  integrity  checking9. 

Summary  and  discussion 

The  methodology  we  have  presented  focuses  on  global  design  issues  of  a  VLSI  implementa- 
tion -  data  flow,   functional  decomposition,  geometric  layout  of  components.     If  we  use  '+' 
to  denote   'one  or  more  applications  of,   then  the  CHiP  architecture  methodology  could  be 
described  as 


SPIE  Vol  34 1  Real  Time  Signal  Processing  V  ( 1 982)  /      J  3 


Figure  7.     The  specialized  layout. 


(program,   test   (assess,   revise,   test)    )  specialize 

This  methodology  leads  to  a  design  with  a  shallow  hierarchy,  making  it  most  effective  for 
highly  regular  algorithms  such  as  digital  signal  processing  systems. 

The  CHiP  architecture  is  crucial  to  the  methodology.     The  switch  lattice  provides  a 
medium  that  mirrors  raw  silicon:     it  is  planar;   it  has  integrated  processing  and  intercon- 
nection facilities;   it  is  described  geometrically;  external  data  is  available  only  at  the 
perimeter.     Consequently,  programming  an  algorithm  for  a  CHiP  architecture,   though  reasonably 
convenient,  gives  a  good  approximation  to  a  VLSI  layout. 

It  is  this  feature,  a  convenient  programming  abstraction  imposing  VLSI-like  constraints, 
that  perhaps  most  distinguishes  the  CHiP  methodology  from  others  in  which  the  specification 
form  is  divorced  from  the  technology. 

Related  results 
There  are  three  points  to  be  made  about  related  research. 

First,   from  our  study  of  configuration  settings  we  have  developed  a  library  of  efficient 
embeddings  for  commonly  used  interconnection  structures.     These  include  single  corridor, 
planar,   linear  area  binary  trees1'10,  toruses  with  no  long  data  paths10,  shuffle-exchange 
graphs  with  narrow  corridors,  etc.     For  example,   Figure  8  shows  a  64  node  shuffle-exchange 
graph  embedded  in  a  lattice  with  w=l  and  d=8 .     This  embedding,  due  to  Paul  Morrissett11  is 
of  interest  because,   in  general,   the  shuffle-exchange  graph  requires  very  wide  corridors6. 
In  addition,   there  are  general  embedding  techniques  known  for  common  layout  problems:  the 
Aleliunas-Rosenberg  technique  for  bending  data  paths  around  corners1,   and  lacing  for 
maximizing  the  number  of  data  paths  through  a  region  of  the  graph10. 

Second,  we  have  developed  another  methodology,  called  Processor  Displacement,  that 
assists  the  designer  in  balancing  pin  limitations  with  chip  area  utilization12.  This 
approach  to  determining  the  optimal  amount  of  multiplexing  is  compatible  with  the  CHiP 
architecture  methodology  described  here. 

Third,   the  CHiP  computer  is  intended  to  be  a  general  purpose  parallel  processor  and  as 
such  it  physically  implements  a  switch  lattice  with  programmable  switches  and  micro- 
processors as  processing  elements1.     Were  CHiP  computers  generally  available,  a  signal 
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Figure  8 . 


A  64  node  shuffle-exchange  graph. 


processing  system  could  be  built  simply  by  running  the  top  level  program  of  our  methodology. 
This  solution  to  constructing  a  special  purpose  signal  processor  probably  would  not  have 
sufficiently  good  performance  to  serve  most  applications.     Although  easily  accomplished, 
this  would  be  too  general  a  solution  for  a  high  performance  device.     Our  methodology  on  the 
other  hand  can  lead  to  high  performance  but  requires  much  effort.     There  could  be  a 
compromise  solution:     We  are  exploring  the  possibility  of  semispecialized  CHiP  computer 
which  would  replace  the  general  purpose  microprocessor  PEs  with  functional  units  tailored 
to  a  specific  application.     CORDIC  processors  are  good  candidates  for  these  specialized 
PEs1 3 . 
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Air  .tract 

An  emerging  belief  among  m;iny  researchers  is  that  a  significant  portion  of  the  nexl  generation  of  high  performance  computers 
will  be  based  on  architectures  capable  of  exploiting  very  large  scale  integration  (VLSI)  modules.  In  particular,  it  is  desirable  to  have 
a  compact  system  that  can  be  plugged  in  with  interchangeable  high  performance  modules  to  fit  various  application  requirements. 
The  system  can  be  an  efficient  signal  processor  when  special  purpose  signal  processing  modules  are  used;  it  can  also  be  an  efficient 
database  machine  when  the  modules  are  replaced  with  data  processing  modules.  This  paper  discusses  some  of  the  issues  in  the 
design  of  such  a  system,  and  describes  the  framework  of  a  system  thai  is  being  developed  at  CMU. 

L.  hit  reduction 

Tremendous  progress  has  been  made  in  recent  year's  on  research  concerning  special -purpose  VLSI  designs.  Many  algorithms 
and  architecture:!  that  are  welt  suited  for  VLSI  implementation  nave  been  proposed,  and  some  of  them  have  already  been 
implemented  in  chips.  More  importantly,  useful  design  methodologies  for  special-purpose  systems  have  been  evolved.  For 
example,  using  the  systolic  approach,  many  important  functional  modules  in  application  areas  such  as  signal  and^  imaging 
processing,  as  well  as  in  database  processing,  can  be  mapped  into  silicon  cost-effectively  (see  die  recent  expository  paper  ).  From 
these  results  and  other  advances,  such  as  fast  FFT  machines  and  chips  that  are  becoming  commercially  available,  we  feel  that 
design  and  implementation  costs  of  special  purpose  devices  me  no  longer  a  major  difficulty.  A  pressing  problem  now  is  Uiat  of 
system  issues  such  as  convenient  incorporation  of  these  devices  into  complete  systems,  and  their  effective  utilization  from  a  system 
point  of  view. 

2.  System  Issaes  Illustrated  by  Examples  in  Application 

In  this  section  we  identify  relevant  system  issues:  by  studying  some  concrete  examples.  To  illustrate  the  issues  it  suffices  to 
consider  Uiese  examples  separately,  aldiough  in  practice  several  of  them  may  actually  be  "chained"  together  in  one  single 
application.  For  instance,  some  signal  processing  applications  require  that  a  matrix  multiplication  be  performed  following  an  FFT 
computation.  At  the  end  of  dris  seclion.we  will  bricily  discuss  the  "chaining"  problem. 

2.1.  Matrix  Multiplication 

Suppose  that  we  want  to  build  a  system  that  can  incorporate  special-purpose  devices  for  performing  matrix  multiplicadons. 
Any  of  such  devices  at  the  hardware  level  can  only  multiply  matrices  of  some  fixed  orders.  Therefore  for  a  given  problem  of 
arbitrary  size,  we  must  decompose  it  into  subproblems  so  that  each  of  these  subproblems  can  be  solved  by  one  of  these 
special-purpose  devices.  In  the  following  we  assume  that  the  given  matrices  are  A  =  (aj.)  and  B  =  (bjj),  both  nxn.  and  that  we  want 
to  compute  their  product  C=(c«).  For  illustration  the  only  special-purpose  device  that  we  will  consider  is  a  straightforward 
one-dimensional  systolic  array  described  below. 

One-dimensional  systolic  array  for  matrix  multiplication.  The  array  consists  of  k  cells,  each  capable  of  performing  a  mulfiply- 
accumulation  operation,  as  depicted  in  Figure  1.  We  assume  that  k  is  a  constant  much  less  than  n.  For y=  1 .2, the y'-th  cell 
from  the  left,  computes  the  inner  product  of  the  two  vectors  (6'(1,a/  2, . . .  ,al  rt)  and  (bx  j,b2i,  ■  ■  ■  ,bnj).  Thus  the  whole  array 

computes  the  product  of  the  row  vector  (aivai2  ai  n)  and  the  matrix  B'  consisdng  of  the  first  k  columns  of  the  matrix  B.  By 

pumping  all  the  row  vectors  in  A  into  the  array  one  after  another  and  by  recirculating  (b^j.b^, . . .  ,bnj)  around  cell  j  for  each  / 
matrix  mul  dpi  icadon  C '  =  A  x  B '  is  performed . 

Three  levels  of  I/O  requirements.  Since  the  systolic  array  above  can  handle  k  columns  of  matrix  B  in  one  pass,  to  use  it  to 
multiply  A  and  B  it  is  natural  to  decompose  B  into  submatrices  BVB2, . . .       where  h=  \n/k~\,  each  />.  has  k  columns  of  B  for 

j—  1  h—l,  and     may  have  less  than  k  columns  of  B.  The  product  C  of  A  and  B  is  decomposed  similarly  into  submatrices  Cv 

Cv  — Cfv  The  systolic  array  computes  C\  —  AxB}  first,  C:  =  AxB2  next,  and  so  on.  One  can  check  diat  each  element  in  A  enters 
the  systolic  array  h  times,  whereas  each  element  in  B  enters  the  array  n  times.  Elements  in  C  never  have  to  be  input  to  the  array, 
and  they  are  output  only  after  their  final  values  have  been  accumulated.  Thus  the  I/O  requirement  for  B  is  the  highest,  that  for  A 
is  the  second  and  that  for  C  is  the  lowest.  This  suggests  the  use  of  hierarchical  memories  with  different  sizes  and  access  rates  to 
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Figure  1:  One-dimensional  systolic  matrix  multiplication  array. 

store  the  three  matrices. 
2.2.  2-D  Convolution 

One  of  the  most  compute-intensive  tasks  in  image  processing  is  the  two-dimensional  (2-D)  convolution  problem.  Consider  the 
problem  of  convolving  an  mxm  kernel  with  an  nxn  image,  where  m<n.  Assume  that  m  is  not  large  enought  to  make  an 
FFT-based  solution  cost-effective.  Then  the  straightforward  method  of  solving  the  problem  will  require  0(m}n2)  operations. 
Several  systolic  devices  for  the  2-D  convolution  problem  have  been  recendy  proposed  (see,  e.  g.,  '  3'  4) 

Problem  decomposition.  Suppose  that  such  a  device  can  only  handle  kernels  of  size  kxk.  where  k  <  m.  Then  the  mxm  kernel 
has  to  be  decomposed  into  subkernels  of  size  kxk.  Consider  the  algorithm  that  slides  the  kernel  from  left  to  right  until  the  right 
boundary  of  the  image  is  reached.  Then  slide  down  one  row  of  image  and  do  the  same,  until  the  bottom  of  the  image  is  reached. 
For  the  subkernels,  first  input  subkernel  A, }  to  the  2-D  convolution  device,  and  then  pass  the  first  k  rows  of  image  through  the 

device;  partial  sums  are  obtained.  Do  the  same  for  other  subkernels,  K12  K\.\m/k\  3,1  d  add  the  resulting  partial  sums 

together.  Now  input  subkernel  A'2 ,  to  the  2-D  convolution  device,  and  pass  the  second  k  rows  of  the  image  through  the  device  arid 
add  the  resulting  partial  sums.  Similar  operations  are  performed  for  other  subkernels.  The  algorithm  is  illustrated  in  figure  2. 

I/O  requirement.  From  this  algorithm,  it  is  easy  to  see  that  each  kernel  element  enters  the  device  0(n )  times,  each  entry  of  the 
input  image  0(m2/k  )  times,  and  that  of  the  resulting  image  is  never  input  to  the  device.  Thus  the  I/O  requirement  for  the  kernel 
is  die  highest,  diat  for  the  input  image  is  the  second,  and  that  for  the  resulting  image  is  the  lowest. 


2.3.  FFT 


Consider  the  problem  of  computing  the  /?-point  discrete  Fourier  transform  (DFT)  by  the  fast  Fourier  transform  (FFT) 
algorithm.  Suppose  that  we  have  a  special-purpose  device  that  can  compute  Appoint  DFTs,  where  k  is  much  smaller  than  n. 

Problem  decomposition.  Decomposition  for  the  FFT  is  not  as  straightforward  as  that  for  the  matrix  multiplication  or  the  2-D 
convolution  problem.  Figure  3  depicts  an  «-point  FFT  computation  and  a  decomposition  scheme  for  n— 16  and  k  =  4.  Note  that 
each  subcomputation  block  is  sufficiently  small  so  that  it  can  be  handled  by  the  4-point  device.  During  execution,  results  of  a  block 
must  be  temporarily  stored  and  later  retrieved  to  be  combined  with  results  of  other  blocks  as  they  become  available.  With  the 
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F  igure  2:  Decomposing  the  2-D  convolution  problem  for  a  device  of  size  4x4. 

decomposition  scheme  shown  in  Figure  3  (b),  the  tola!  number  of  I/O  operations  is  0{n  log«/  logfc).  In  fact,  it  has  been  shown 
that,  to  perform  the  /j-point  FFT  with  a  device  of  O(k)  memory,  at  least  these  many  I/O  operations  are  needed  for  any 
decomposition  scheme  . 
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Figure  3:  (a)  16-point  FFT  graph;  (b)  decomposing  the  FFT  compulation  with  n=  16  and  k  =  4. 

I/O  Requirements.  From  the  discussion  above,  we  see  that  the  I/O  requirement  for  the  FFT  computation  is  inversely 
proportional  to  the  logarithm  of  the  memory  size.  Thus  for  the  FFT  compulation  it  is  possible  to  trade  the  size  of  a  memory  for  its 
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speed.  Figure  4  (a)  depicts  a  scenario  (hat  ;i  special-purpose  FFT  device  is  supported  by  a  two-level  memory;  Figure  4  (b)  shows 
some  of  the  memory  si/c  and  speed  trade-off  results. 
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Figure  4:  (a)  Two-level  memory  for  the  FFT  computation;  (b)  memory  size  (M)  and  speed  (S)  trade-offs. 

Controi  requirements.  A  fairly  nontrivial  Uisk  in  the  us?  of  special-purpose  devices  for  the  FFT  computation  is  the  generation  of 
suhcomputation  blocks  as  shown  in  Figure  3  (b).  Inputs  to  such  a  block  in  general  are  outputs  from  other  blocks  multiplied  by 
some  so-called  "twiddle"  factors.  This  requires  shuffling  bits  Sr.  addresses  and  generating  the  twiddle  factors.  Some  custom 
hardware  is  likely  needed  here  to  keep  up  the  speed. 

2.4.  Chaining 

Chaining  can  often  speed  up  computation  and  reduce  I/O  requirements  by  factors  of  2  or  more.  In  general  different  problems 
need  different  chaining  schemes.  Consider  for  example  the  problem  of  multiplying  matrix  A  by  B  with  one  systolic  matrix 
multiplication  array,  and  then  multiplying  the  resulting  matrix.  C= -  AxB,  by  another  given  matrix  D  with  another  systolic  array. 
Assume  that,  the  systolic  arrays  are  one-dimensional  arrays  each  having  k  cells,  as  illustrated  in  Figure  1,  and  that  the  decomposition 
scheme  of  Section  2.1  is  used.  As  soon  as  the  first  /.  columns  of  C  arc  computed  by  the  first  systolic  array,  they  can  be  chained  into 
the  second  systolic  array  so  that  it  can  start  computing  the  first  k  columns  of  CxD.  Similar  chaining  can  be  applied  to  other  groups 
of  k  columms.  Chaining  calls  for  provision  of  direct  communication  padis  between  special-purpose  devices. 

3.  A  Conceptual  System 

After  having  identified  some  of  the  issues  for  incorporating  special-purpose  devices  into  a  system,  here  we  outline  a  conceptual 
system  addressing  these  issues. 

3.1.  System  Block  Diagram 

The  system  block  diagram  is  shown  in  Figure  5.  Functions  of  the  blocks  are  described  in  the  following: 

Host  Processor  —  HP.  This  is  the  central  controller  for  the  system.  It  runs  the  operating  system,  and  schedules  and  monitors 
activities  of  all  components  in  the  system.  In  the  simplest  form,  it  could  just  be  a  micro-store,  loadable  with  microcodes  from  the 
outside  world.  But  in  some  applications  it  may  be  necessary  for  the  HP  to  perform  some  scalar  operations  that  can  not  be  done 
cost-effectively  elsewhere. 

Host  Memory  —  HM.  This  is  the  system  memory  that  in  theory  is  capable  of  holding  all  the  input  and  output  data  for  a  given 
problem.  For  example,  for  the  matrix  multiplication  problem  it  should  be  able  to  hold  the  two  input  matrices  and  the  resulting 
matrix.  In  practice,  for  exceedingly  large  problems  HM  should  at  least  be  large  enough  to  serve  as  a  buffer  so  that  communication 
with  the  outside  world  will  not  become  a  performance  bottleneck  for  the  system. 

Host  I/O  —  I/O.  Through  the  I/O  port,  data  and  microcodes  can  be  input  to  and  out  from  the  system. 

Intermediate  Processor  —  IP.  This  processor  serves  as  the  interface  between  the  system  and  special-purpose  devices.  It  has  a 
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Figure  5:  System  block  diagram. 

relatively  large  (maybe  interleaved)  memory  to  provide  an  intermediate  level  of  memory  in  the  memory  hierarchy  called  tor  by 
many  applications,  as  discussed  in  Section  2.  Potentially,  it  can  also  coordinate  those  special-purpose  devices  to  which  it  interfaces 
to  implement  closely  coupled  functions  such  as  the  "chaining"  operation. 

Special-purpose  Processor  —  SP.  An  SP  is  a  special-purpose  or  systolic  processor  for  performing  some  high-level  functions 
such  as  matrix  multiplication  and  FFT;  it  may  have  some  private  memory  to  store  data  that  are  most  frequently  accessed.  To  the 
system  SPs  are  interchangeable;  therefore  it  is  important  that,  the  I/O  behaviors  of  the  SPs  be  regular.  Fortunately,  this  is  the  case 
for  most  systolic  processors1. 

SP  Handler  —  SPH.  The  SPH  is  a  microprogrammable  processor;  every  SP  in  the  system  is  served  by  one  specially 
programmed  SPH  as  its  controller.  In  addition,  addresses  of  inputs  to  and  output  from  an  SP  are  generated  by  its  SPH.  These 
addresses  are  sent  to  the  IP  memory  to  which  the  SP  interfaces. 

4.  System  Programming 

We  have  introduced  die  major  components  in  the  system.  This  section  describes  how  the  system  should  be  programmed  to 
fulfill  its  objective  of  providing  a  convenient  environment  for  the  users  to  utilize  special-purpose  devices  available  in  the  system. 

4.1.  Compilation  Phase 

During  this  phase  source  programs  submitted  by  users  are  transformed  into  programs  directly  executable  by  special-purpose 
devices.  Major  stages  of  this  phase  are  shown  in  Figure  6. 

Decomposition  stage.  As  discussed  earlier,  to  utilize  a  special-purpose  device  a  problem  of  arbitrary  size  may  have  to  be 
decomposed  into  subproblems  of  smaller  sizes  so  that  each  of  the  subproblems  can  be  solved  directly  by  the  special -purpose  device. 
Outputs  from  the  decomposition  stage  are  a  collection  of  subproblems.  possible  data-dependency  among  them,  and 
recommendation  on  their  execution  order  in  order  to  minimize  the  I/O  cost.  To  define  a  subproblem  one  has  to  specify  where  its 
input  data  come  from  and  where  its  outputs  are  to  be  stored.  In  general  different  decomposition  routines  are  needed  for  different 
problems  or  different  special -purpose  devices. 

Scheduling  stage.  Based  on  the  outputs  from  the  decomposition  stage,  the  system  configuration,  and  system  component  status, 
subproblems  are  assigned  to  various  SP's  during  this  stage.  Outputs  of  this  stages  form  a  sequence  of  instructions  to  the  SP's. 

4.2.  Operation  Phase 

Codes  produced  during  the  compilation  phase  are  carried  out  during  the  operation  phase.  The  system  is  first  set  up  by  sending 
initialization  parameters  to  its  components.  When  operation  phase  starts,  data  required  by  an  SP  are  sent  from  HM  to  the  IP  to 
which  the  SP  interfaces.  The  SP  retrieves  its  inputs  from  and  stores  its  outputs  to  the  IP  memory  according  to  the  addresses 
generated  by  the  associated  SPH.  The  HP  monitors  activities  at  or  above  the  IP  level,  and  is  responsible  to  data  transfers  among  the 
IPs  and  the  HM. 

To  understand  system  programming  issues  and  the  overall  system  control,  at  CMU  we  have  written  lisp  programs  to  simulate 
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Figure  6:  Stages  in  the  compilation  phase. 

systems  that  u?e  special-purpose  devices  for  performing  computations  such  as  matrix  multiplication,  2-D  convolution  and  FFT. 
Furiher  simulations  will  be  performed  before  the  system  architecture  is  finalized.  The  plan  is  to  build  a  VAX-based  prototype  host 
machine  with  die  concepts  outlined  in  this  paper  in  the  near  future. 
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Abstract 


The  harmonic  analysis  of  binary  approximations  to  trigonometric  reference  functions  employed  in  digital 
signal  processing  functions  is  described.  An  analytic  representation  of  the  decomposition  of  sine  or  cosine  func- 
tions into  elementary  rational  binary  functions  is  suggested  which  permits  a  direct  solution  to  the  problem  of 
calculating  the  harmonics  of  the  error  of  approximation.  For  applications,  such  as  ROM  based  frequency-syn- 
thesizers, angle-encoders,  and  discrete  Fourier  transformations,  the  approach  illustrated  provides  a  convenient 
method  of  relating  the  precision  of  functional  implementation  of  microelectronically  integrated  functions  to  the 
harmonic  content  of  the  functional  error. 

Introduction 

The  complexity  of  microelectronically  integrated  signal  processing  functions  is  directly  related  to  the  allow- 
able error  in  integrated  functional  processor  responses  to  representative  classes  of  input  functions  for  choices  of 
algorithm,  throughput,  numerical  method,  and  technique  of  implementation.  The  allowable  error  is  usually  diffi- 
cult and  costly  to  relate  to  a  discrete  version  of  an  algorithm  to  be  implemented  digitally  since  the  complications 
introduced  by  arithmetic  non-linearities,  or  irregularities,  frequently  preclude  extensive  quantitative  estimates 
other  than  a  gross  mean-squared-error  average.  This  is  particularly  true,  for  example,  in  the  case  of  discrete 
Fourier  transformations  in  which  digitally  encoded  sine  functions  are  used  for  the  calculation  of  the  required 
inner- products.  In  this  paper  a  direct  method  of  determining  the  maximum  spurious  harmonic  components  of 
encoded  trigonometric  functions  is  presented.  Application  of  the  method  is  illustrated  with  a  portrayal  of  the 
spurious  harmonics  for  the  case  of  a  1024-point  discrete  Fourier  transformation,  DFT,  employing  eight-  and  nine- 
bit  representations  of  trigonometric  functions. 

Direct  Harmonic  Decomposition 

Consider  that  some  algorithm  for  digitally  encoding  y(9)=cos  6  into  L  distinct  binary  levels  on  N  equally 
spaced  intervals,  2tt/N,  has  resulted  in  a  set  of  integers,  k,  with  which  lev  el- trans  it  ions  of  an  n-bit  binary 
representation  of  y(2Trk/N),  y(2Trkj/N),  may  be  associated  at  the  points  kje<.  The  purpose  of  the  following 
development  is  to  obtain  the  Fourier  coefficients  of  y  in  order  to  evaluate  directly  the  harmonic  content  of  the 
error  of  approximation,  (y-y),  for  any  particular  choices  of  kje<,  N,  and  n.  This  may  be  conveniently  accom- 
plished by  the  following  approach  to  the  decomposition: 

Define  the  set  of  2it  periodic  discrete  functions,  xj,  on  the  integers,  k,  as 

II    ,    2irk  j  /N<2irk/N<2TTk  j /N 
0    ,    2Ttki/N<2Ttk/N<n-2irki/N  (1) 
-1    ,  TT-2rki/N<2irk/N<Tr+2TTki/N 

where  Xj(2n  +  2nk/N)  =  xj( 2tt k/N ) ,  0  <  k  «  N. 

Let  y(6)=sin  9  be  approximated  by  a  sum  of  these  functions  with  amplitudes,  2  1  ,  a  binary  multiple 
of  the  magnitude  2~n,  where  aj  is  an  integer  (ka^n-l,  as 


y(k)   =   I     2^   nxi(2Tik/N)    ,    0   <  k  <  N- 1 ,  (2) 
i=l  i  p    <  2n-l. 


Defining  the  Fourier  transformation  of  these  even  functions  in  terms  of  the  piecewise  continuous  extensiont, 
xj(9), 

1  ,2lT 

Fi7Cfli(m)=  4     J       y(9)cosm9d9   ,    0   <  m  <  «   ,  (3) 
y(e)  w     o  0   <    9   <,   2 it 


t  Since  a  relative  comparison  of  the  harmonic  components  of  the  error  is  desired,  the  continuous,  rather  than  the 
discrete  version  of  the  frequency  functions  is  used  here.  Strictly  speaking,  the  frequency  functions  more 
appropriately  might  be  computed  on  the  unit  circle  since  the  functions  involved  are  defined,  on  the  integers; 
however,  for  the  purpose  of  exposition  the  extension  simplifies  the  description  of  the  decomposition. 
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or,  from  (2), 

i  i 

i      ,2lT     p         a ;  -n 
F-(m)   =  -t     J  2    1       x,(9)    cosmBde.  (4) 

7  °  i=l 

Then,  using  the  definition  for  xj  in  (1),  we  obtain  the  Fourier  coefficients  of  y(k)  evaluated  at  each  harmonic, 
m, 

,      v        a;-n+2     sin  (2Trmk;/N) 

Fy(k)(m)   =  ¥     I     2   m  '  >m  odd:  (5) 

k  j  ek 

and,  for  completeness, 

,             a.--n+2     ™     s  i  n  ( 2  Trmk  •  /N ) 
y(k)   =  £     I     2    1  I  ^-i   cos2Ttmk/N  (6) 

k : e<  m= 1 

1  odd  0<k<N-l 

Thus,  the  evaluation  of  the  spurious  harmonics  of  (y-y)  is  obtained  from  (5)  for  m*l  and  odd.  The  set  <  is 
conveniently  conceived  in  terms  of  subsets,  <n,  <,,...  ,<n,  for  computational  purposes,  where  a  =2°, 
a^1  ap=2P.  o        i  P  0 

Then,                            2"n+2   ,      r       n   sin(2irmki  /N)              „            ,    s  i  n  ( 2  irmk  ■  /N) 
Fy(k)(m)   =  [     I     2   ffT1   +       I         21  n~  


k  J  G  K  | 


„  s  i  n  (  2  it  mk  ^  /  N ) 

2p  aj-a —  ]  (?) 


kiEKp 


where  Kp  is  the  set  of  integers  at  which  the  change  in  y(kj)=2P.  These  subsets  of  k  are  mutually  exclusive 
whenever  the  changes  in  y  are  strictly  even.  This  is  generally  the  case  unless  the  encoding  is  highly  refined  as 
will  be  shown. 

Determination  of  kj 

The  set  of  points  at  which  y(k+l)-y(k)=Ay*0,  k\cK ,  is  determined  by  some  algorithm  for  digitally  encoding 
y(9)=sin  6  into  L  distinct  binary  levels  on  N  equally  spaced  intervals,  2WN.  For  illustration  the  case  of  encoding 
y(k)=A  sin(2irk/N),  A<1,  over  the  interval  0<  k<  (N/4)-l,  k,  N  integers,  using  (n+1)  bits  rounded  to  n  bits  as  an 
approximation  has  been  considered. 

Two  regions  of  interest  may  be  noted  in  connection  with  the  generation  of  k.  For  small  values  of  k,  k<<N/4, 
the  change  y(k+l)-y(k)-Ay  may  substantially  exceed  2~n  resulting  in  reducing  the  number  of  codes  for  the 
members  in  <  by  such  coding  omissions.  For  large  values  of  k,  k=N/4,  the  change  in  y,  Ay,  may 
not  exceed  2~n  resulting  in  reducing  the  number  of  members  in  k  by  these  omissions.  In  particular  if 
N=2S,  codes  will  be  omitted  whenever  2S_ n<  2tt  for  A=l§k<<N/4;  and,  in  the  neighborhood  of  N/4  roughly 
I  codes  will  be  omitted  for  k=(N/4)-Jl,  i><<N/4  where  Jl=22s-n_2/Ti2 .    Thus,  for  s=10,  n=8,  and  A=l 

<n   =    (2,4,6,9  ,  1  1,13    226,230  ,  234,239) 

0  (8) 

k  1  =    (  1 , 3 , 5 , 7  ,  8  ,  10  ,  12  ,  .  .  .  ,  1  09 ,  1  13 ,  1  18  ,  124) 

totaling  201  distinct  transitions,  Ay*0,  of  the  total  possible  255.  With  a  scaling  change  to  A=l-2"n+1  and 
s=10,  n=8,  then  L=203  and 


(9) 


KQ   =    (2,4,6  ,  8  ,  1  1,13    226  ,  229  ,  234,239  ,  246) 

k:  =    (1,3  ,  5  ,  7,9,10    105,1  10,1  14,1  20  ,  1  28) 

A  comparison  of  interest  is  the  case  of  s=10,  n=7,  A=l 


=    (1,2,4,5,6,8  213,218,22  3  ,232)  (10) 
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and  for  s=10,  n=7,  A=l-2~n 


0' 


(1,2,4,5,6,8  213,218,224,231,242) 


(11) 


with  all  other  sets  empty  and  L=127  in  the  last  two  cases.  Evaluation  of  the  Fourier  coefficients  is  particularly 
simple  in  these  cases  since  equation  (7)  reduces  to 


F-(k)(m)    =  (2 


-2n+2 


,11-1 
i=l 


sin(2nmk;/N) 


,m  odd, 


(12) 


Many  other  encoding  algorithms  can  be  conveniently  compared  in  terms  of  the  harmonics  directly  calculated  using 
equation  (7)  or  (12)  and  these  subsets,  <p  of  k.  These  harmonics  are  discussed  in  the  context  of  an  application 
in  the  next  section. 

An  Application 

The  harmonic  content  of  the  error,  (y-y),  has  been  calculated  for  the  four  cases  illustrated  by  the  sets  given 
as  (8),  (9),  (10),  and  (11).  The  maximum  spurious  component  in  each  case  is  presented  in  Table  1  where  the 
ratio  of  the  spurious  harmonic  amplitude  to  the  fundamental  amplitude  is  expressed  in  decibels  (dB). 

TABLE  1 


Bits  of 
Magnitude 

Case 

Maximum 
Harmonics 

Relative 
Amplitude,  dB 

8 

8 

7,9 

-61,-61 

8 

9 

3,5 

-54,-64 

7 

10 

3,5 

-57,-57 

7 

11 

3,7 

-54,-61 

Maximum  Spurious  Harmonics 


More  generally  the  behavior  of  the  harmonics  is  as  illustrated  in  Figure  1  which  shows  the  bounds  below 
which  the  harmonics  fall  in  cases  (8)  and  (9)  for  8  bits  of  magnitude  encoded.    Two  observations  can  be  made 
from  such  data:     1)  the  effect  of  scaling  the  function  to  be  encoded  is  primarily  to  reduce  the  lower  harmonics, 
and  full-scale  encoding  is  preferable  for  the  cases  considered,    2)  there  is  a  general  decline  in  the  relative  am- 
plitudes of  the  harmonics  which  is  not  monotonic. 

Further  data  experiments  have  shown  that  no  harmonic  in  any  of  the  four  cases  illustrated  falls  above  the 
maximum  spurious  harmonics  given  in  Table  1.  Scaling  for  s=10,  n=8,  and  A=l-2~n  yields  L=202  and  the 
maximum  spurious  harmonics  at  3,  9  at  -56  dB  and  -63  dB  respectively. 

Figure  2  shows  the  results  for  seven-bit  magnitude-encoding  for  cases  (10)  and  (11).  The  general  conclu- 
sions are  the  same  as  those  for  Figure  2;  however,  an  overlay  of  the  two  figures  for  direct  comparison  indicates 
the  general  superiority  of  the  eight-bit  case  through  the  thirtieth  harmonic  with  little  benefit  beyond,  leaving  the 
conclusion  that  Table  1  contains  the  major  distinctions  for  the  four  cases. 

Data  experiments  were  performed  to  test  the  sensitivity  of  the  relative  amplitude  of  the  maximum  harmonics  to 
the  elimination  of  a  constituent  function,  Xj,  for  both  the  case  of  kj  small  and  kj  mid-range.  In  case  (8)  the 
maximum  spurious  components  increased  by  eight  to  fifteen  dB  when  a  mid-range  (kj=124)  component,  Xj,  was 
eliminated,  and  the  maximum  spurious  components  increased  by  four  to  sixteen  dB  when  a  small  (kj=2)  compo- 
nent, xj,  was  eliminated. 

The  application  of  these  results  may  be  appreciated  in  the  context  of  the  behavior  of  the  discrete  Fourier 
transformation  for  various  degrees  of  quantization  and  arithmetic  complexity.  In  both  the  case  of  fixed-point 
computational  elements  (ACE)  and  the  case  of  block-floating-point  computational  elements  (BCE)  with  seven  and 
eight  bit  magnitude  encoding  of  trig-weights  the  encoding  of  the  sines  and  cosines  limit  performance  above 
certain  Fourier  transformer  input  word  sizes.  This  conclusion  can  be  established  using  the  total  mean-squared- 
error  (MSE)  between  ideal  fully  floating  point  computation  and  limited  word-size  computation  with  block  floating 
point  (BF)  arithmetic  methods  and  with  fixed-point  arithmetic  methods.  The  results^  Qf  such  comparisons 
typically  behave  as  shown  in  Figure  3  for  a  1024  Fast  Fourier  Transformation  (FFT).  The  asymptotic  region  shown 
for  input  word  sizes  to  the  FFT  over  fifteen  bits  suggest  that  performance  is  limited  not  by  input  word-size  or 
arithmetic  method  but  by  the  nature  of  the  trigonometric  function  encoding.    The  speculation  that  the  arithmetic 


(1)  Private  communication  with  the  permission  of  J.D.  Kislinger. 


SPIE  Vol  34 1  Real  Time  Signal  Processing  V/1982)  /  25 


operations  related  to  inner- product  computation  in  this  region  are  linear  is  borne  out  by  the  agreement  between 
experimental  results  with  actual  FFT  signal  processors  and  the  results  are  presented  in  Table  1  and  Figures  1 
and  2.  Since  for  a  single  sine-wave  input  the  output  of  the  FFT  is  simply  the  harmonic  content  portrayed  as  that 
associated  with  the  encoded  trigonometric  weights,  the  verification  of  Table  1  is  straightforward  at  the  output  of 
the  FFT  as  the  convolution  of  an  impulse  function  at  the  input  frequency  with  a  frequency-function  having  the 
harmonics  portrayed  in  Figures  1  and  2. 


FIGURE  1.    SPURIOUS  HARMONICS  FOR  EIGHT  BIT  MAGNITUDE  ENCODING  FIGURE  2.    SPURIOUS  HARMONICS  FOR  SEVEN  BIT  MAGNITUDE  ENCODING 


Conclusions 


A  simple  method  for  directly  calculating  the  harmonics  of  the  error  of  binary  encoding  of  periodic  waveforms 
is  presented  and  illustrated  for  trigonometric  functions.  An  application  to  the  design  of  Fast  Fourier  Transfor- 
mers is  suggested  in  quantitative  terms  permitting  an  evaluation  of  the  spurious  harmonics  to  be  expected  as  a 
function  of  the  degree  of  encoding.  The  context  in  which  these  results  have  a  bearing  on  machine  complexity  is 
suggested. 
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Abstract 

A  machine  architecture  for  computing  the  eigenvalues  and  eigenvectors  of  an  Hermitian 
matrix  is  presented.     Two  systolic  arrays  are  used,   one  for  reducing  full  matrices  to  band 
matrices,   the  second  for  performing  QR  iteration  on  band  matrices.     A  one-parameter  family 
of  systems,   parameterized  by  the  bandwidth  of  the  reduced  matrix,   is  available.  This 
allows  a  tradeoff  of  processors  for  execution  time. 

Introduction 

Several  recent,   powerful,   data-adaptive  methods  in  signal  processing  require  computing 
the  eigensystem  of  an  Hermitian  matrix;   for  example,   the  MUSIC  algorithm  for  multiple 
signal  classification1   and  several  recent  techniques    (the  "enhanced"  minimum-variance  and 
maximum-entropy  methods)    for  power  spectrum  estimation2 .     Specialized  computer  architectures 
making  effective  use  of  parallel  arrays  of  computing  cells,   implemented  as  VLSI  circuits, 
would  be  a  cost-effective  way  to  implement  these  methods.     A  general  family  of  such 
architectures,  called  "systolic  arrays"  by  Kung3,   has  to  date  proven  very  useful  for  many 
similar  problems  in  computational  linear  algebra.     In  fact,   for  systems  of  linear  equations, 
least  squares  problems,   and  matrix  multiplication,   a  variety  of  useable  designs  are  now 
known1*'5'6.     But  little  work  has  been  done  on  the  eigenvalue  problem. 

QR  iteration 

The  standard  approach  to  the  eigenvalue  problem  Ax=Ax  is  to  iteratively  form  a  sequence 
( n ) 

of  matrices  A       ,   n=0,   1,  converging  to  the  diagonal  matrix  of  eigenvalues  of  A  and 

each  unitarily  similar  to  A.     We  begin  by  setting 


Now,   for  n  =  0,    1,  factor 


A(0)    =  A. 

A(n)   =  Q(n)R(n)  (la) 

where  Q^n^    is  unitary,   R  is  upper  triangular,   and  in  the  case  of  complex  matrices,  the 
diagonal  elements  of  R  are  real.     Then  form 

A(n+1)   =  R(n)Q(n)   =  Q (n) *A (n) Q (n) _  (lb) 
It  can  be  shown  that  convergence  is  very  rapid7. 

3 

On  ordinary  machines,   each  iteration  would  require  0(N  )   operations.     To  quicken  the 
algorithm,   the  matrix  A  is  first  reduced  to  tridiagonal  form:   thus,   the  first  iterate  is 
defined  by 

(0)  * 

AKU'   =  P  AP  (2) 

where  P  is  a  unitary  matrix  chosed  to  make  A        tridiagonal.     This  tridiagonal  reduction 
takes  0(N3)    operations,  but  it  need  be  done  just  once.     The  QR  iteration  preserves 
tridiagonal  form.     And  applied  to  tridiagonal  matrices,  QR  iteration  only  needs  O(N) 
operations.     Another  advantage  is  that  approximate  eigenvalues  are  easily  available  from  the 
tridiagonal  iterates  and  can  be  used,   in  a  shifting  strategy,   to  further  speed  the 
convergence . 7 

We  are  led,   therefore,   to  consider  systolic  array  designs  for  the  tridiagonalization  (2) 
of  a  dense  Hermitian  matrix.     It  appears  that  all  methods  for  accomplishing  the  reduction 
proceed  by  applying  a  sequence  of  simple  unitary  operators    (rotations  or  reflections)  from 
the  left  to  zero  a  column  of  A,   then  applying  the  adjoint  of  these  operators  from  the  right, 
which  both  maintains  unitary  similarity  to  A  and  zeros  the  corresponding  row.     A  systolic 
array  of  N  processors  can  be  effectively  used  to  implement  this  method.     But  no 
design  using  more  than  N  processors    (and  using  a  two-dimensional  interconnection  pattern) 
has  been  developed.     Because  these  efforts  have  been  frustrated,  we  next  ask  if,   by  only 
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reducing  A  to  a  banded  structure,  we  can  use  more  processors  and  less  time.     The  answer  is 
this:     a  systolic  array  can  reduce  a  dense  matrix  to  a  unitarily  similar  one  of  bandwidth 
k  using  kN  -  k(k-l)/2  processors  in  time  N2/k.     We  give  the  details  of  the  method  and  the 
systolic  array  in  the  next  section. 

Next,  consider  arrays  for  the  QR  iteration    (1).     An  array  for  the  factorization   (la)  of 
a  banded  matrix  was  described  by  Heller  and  Ipsen.5     We  shall  show  that  this  array  can  be 
used  for  both  the  factorization  and  the  multiplication   (lb).     Both  operations  occur 
simultaneously,   using  half  the  processors  of  the  array  each,   and  taking  very  little  more 
time  than  the  factorization  alone. 

We  shall  assume  throughout  that  the  matrices  discussed  are  complex  and,   unless  stated 
otherwise,   N  by  N. 

Reduction  to  banded  form 

We  say  that  a  matrix  B  has  bandwidth  k  if   | i-j | >k  implies  that  b^^  =  0.     Such  a  matrix 

has  at  most  2k+l  nonzero  diagonals.     The  systolic  array  of  Figure  1  is  used  to  accomplish 
the  reduction  of  a  dense  matrix  to  one  of  bandwith  k.     It  implements  the  matrix  factor- 
ization 


X  =  Q  S  (3) 

L0  I  . 

where  X  is  a  given  p  by  q  matrix,  Q  is  unitary,   R  is  a  k  by  k  upper-triangular  matrix  with 
real  diagonal  elements,   and  S  is  p  by  q-k.     Alternatively,   the  array  can  multiply  a  given 
input  matrix  X  on  the  left  by  a  given  unitary  matrix  Q,  forming 

Y  =  Q  X.  (4) 

The  unitary  matrices,  generated  in  the  factorization   (3)    and  used  in  the  multiplication  (4), 
are  represented  as  a  product  of  plane  rotations  of  the  following  type.     A  typical  such 
rotation  replaces  the  ith  and  jth  row  of  a  matrix,   r.    and  r . ,  by  the  linear  combinations 


-  * 

c  a 

r  . 

—l 

o  c_ 

r  . 

.-dJ 

Figure  2  defines  the  actions  of  the  cells  of  the  array.     The  circular  "boundary"  cells 
compute  the  parameters    (c  and  a,    of  a  plane  rotation  that  zeros  one  of  the  elements  of  the 
input  matrix.     The  square  "internal"  cells  apply  given  rotations,  which  flow  from  the 
boundary  cells  through  the  array,   to  pairs  of  elements  of  the  input  matrix.     Figure  3 
illustrates  the  flow  of  data  through  the  array.     To  carry  out  the  factorization    (3) ,  the 
input  matrix  X  enters  through  the  top  of  the  array,  a  sequence  of  pairs    (c,o)    that  defines 
Q  emerges  at  the  right  edge,   and  the  elements  of  S  and  R  come  out  the  bottom.     This  process 
we  shall  call  a  "forward"  pass.         The  multiplication   (4)    is  done  by  the  same  array. 
The   (c,o)   pairs  defining  Q  enter  at  thp  boundary  cells,  which  do  not  change  them.     X  again 
enters  from  the  top,   and  the  product  Q*X  leaves  from  the  bottom.     This  is  called  a 
"backward"  pass. 

The  array  of  Figure  1  is  identical  to  a  subarray  of  an  array  for  QR  factorization  of 
rectangular  matrices,  originally  developed  by  Gentleman  and  Rung.6     A  detailed  description 
of  the  use  of  this  design  for  solving  least  squares  problems  will  appear  in  a  later  paper.8 
There  it  will  be  shown  that  a  single  VLSI  cell  performing  the  real  operation  c  *■  c  +  ab 
can  be  used  to  fabricate  the  internal  cells  for  a  complex  plane  rotation,  implementation 
of  the  boundary  cells  will  be  discussed,   and  it  will  be  shown  that  the  required  sequence 
of  forward  and  backward  passes  can  be  processed  by  the  array  with  only  a  single  cycle's 
delay  between  them. 

We  now  show  how  the  reduction  to  banded  form  is  done.     Assume  that  J  5  N/k  is  an  integer. 
The  entire  reduction  is  obtained  by  a  sequence  of  J-l  forward  and  backward  passes.  Start 
with 

A,   =  A. 

Then,   for  j  =  1,   2,  J-l,  partition 


A  .  = 

: 


B  .  . 

C  . 

33 

3 

* 

C  . 

D  . 

.  ] 

: 
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Using  the  array,   find  Q^  ,   Rj  ,   and  E_.   such  that 


[  C.    .  D.   ]   =  Q. 


where  R-   is  as  above.     This  is  the  jth  forward  pass.     Next,  let 

*  * 


This  is  the  jth  backward  pass.  Letting 


Zk  0 

0  Q, 


where  1^  is  the  identity  matrix  of  order  k,  we  have  that 


Q*  A.   Q.  = 
:     3  j 


B  .  . 

C  . 
3 

Q3 

R. 

: 

E  . 

0 

3 

1  * 

B  . . 

R. 

0 

33 

[  3 

R  . 
3 

0 

Finally,   let  B       =  A     .  Then 


J-l 


RJ-1  BJJ 


is  the  desired  banded  matrix. 


An  array  for  band  QR  iteration 


The  systolic  array  for  banded  QR  factorization  of  Helle 
carry  out  both  the  factorization  step  (la)  and  the  multipl 
tion.  Figure  4  shows  an  example  of  the  array  for  the  case 
in  which  elements  of  a  matrix  A  to  be  factored  enter  from 
factor  R  emerge  from  the  top,  and  the  plane  rotations  defi 
Notice  that  elements  enter  and  exit  from  a  given  cell  only 
cells  are  idle  at  any  given  time.     Memoryless  versions  of 

To  obtain  the  product  RQ  we  use  the  array  to  compute  it 
insert  R  at  the  bottom  and  the  rotations  defining  Q  at  the 
boundary  cells  apply  the  given  rotation  to  their  inputs  ju 


r  and  Ipsen5     can  be  used  to 
ication  step   (lb)   of  QR  itera- 

k=2.     The  figure  shows  the  manner 
the  bottom,   the  elements  of  the 
ning  Q  leave  at  the  right  edge. 

every  other  cycle  --  half  the 
the  cells  of  Figure  2  are  used. 
*  * 

s  transpose,   Q  R  .     To  do  this  we 

left.     In  this  situation  the 
st  as  the  internal  cells  do. 


The  two  computations  can  be  done  in  parallel  on  the  same  array.     The  idea  is  to  use,  for 
the  RQ  pass,   the  processors  not  in  use  by  the  QR  pass.     Figure  5     illustrates  this  process. 
It  shows  a  sequence  of  cycles  during  which  elements  of  R  are  produced,   leave  the  array  at 
the  top,   and  reenter  from  the  bottom.     At  the  same  time,   the  rotations  leaving  the  right 
edge  immediately  reenter  at  the  left.     To  accomplish  this,   additional  "wrap-around"  connec- 
tions,  and  multiplexors  to  select  the  alternating  sources  and  destinations  for  the  data, 
are  needed. 


In  Table  1  we  give  a  schedule  for  the  array 
R  enter  and  leave  the  array. 


showing  the  times  and  places  at  which  A  and 
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Time 


Table  1.     Systolic  Array  Schedule  for  QR  Iteration 


Event 


Location* 


0 
k 

11 
kl 

enters 

*  * 

enters,   rotation   (k,l)  created 

1,0 
lfk 

2k 

11 

leaves 

k,k 

2k+l 

11 

reenters 

1,0 

3k+l 

^k 

reenters ;  rotation 

(k,l)  reenters 

l,k 

4k+l 

ail 

(an  element  of  the 

new  iterate)  leaves 

k,k 

2N-2 

aNN 

enters 

1,0 

2N 

all 

may  enter  to  start 
(provided  that  N  > 

t 

the  next  iteration 
2k+l) 

1,0 

2N+4k-l 

aNN 

leaves;   end  of  the 

iteration . 

k,k 

Notes : 

*       locations  are  shown  in  Figure  4. 

**     rotation   (i,j)    is  used  to  zero  element   (i,j)   of  A. 

t       when  N  <  2k+l  the  data  becomes  available  only  at  time  4k+2 ;   this  imposes  a  delay 
 between  iterations  and  so  idles  part  of  the  array.  

This  scheme  makes  use  of  the  fact  that,  while  R  has  2k  nonzero  diagonals,  only  the  first 
k+1     need  be  used  when  forming  RQ.     Only  the  lower-triangular  part  of  RQ  is  correctly 
computed,   the  rest  can  be  inferred  from  its  symmetry.     The  k  extra  "fill-in"  diagonals  of 
R   (those  that  lie  outside  the  band  of  A)   have  no  effect  on  this  part  of  RQ  and  so  may  be 
ignored . 

We  want  to  use  the  same  cell  types  as  were  used  in  the  reduction  array.     In  particular, 
we  hope  to  work  with  plane  rotations  involving  a  real  quantity   (a)    and  not  two  complex 
quantities.     The  boundary  cells  can  generate  such  rotations  provided  their  inputs  coming 
from  below  are  always  real.     This  will  always  be  the  case.     This  is  because  the  reduction 
to  banded  form  yields  a  matrix  with  real  elements  on  the  main,  kth-lower  and  kth-upper 
diagonals  and,  moreover,   the  QR  iteration  preserves  this  property. 

Design  Tradeoffs 

Choice  of  the  bandwidth  k  enables  a  tradeoff  of  processors  for  execution  time.  It  is  not 
a  linear  tradeoff,   however.     The  number  of  processors  used  is  given,   for  k  <  N/2,  by 

P  =  kN  +    (3/2)k2  +  0(k) 

The  total  time,   assuming  I  iterations  of  QR  will  be  done,  is 

T  =  2NI  +  N  /k  , 

since  QR  iterations  each  take  2N  cycles  and  the  jth  pass  of  the  reduction  phase  takes 
2(N-jk)    cycles.     Thus,   for  k  <  N/2,   the  time-processor  product  is 

PT  =  N2I(3k2  +  2k)      +     N3     +      (3/2)N2k  . 

Certainly  PT  is  minimized  for  k  =  1.     But  when  k  <<  N  then  PT  is  largely  insensitive  to  k, 
so  it  becomes  quite  reasonable  and  inexpensive  to  reduce  the  overall  time  by  increasing  k. 
This  is  particularly  true  when  I  is  small  compared  with  N,  as  is  the  case  when  not  all  of 
the  eigenvalues  are  to  be  computed.     The  alternative  extreme,  where  k  =  N-l,   in  other  words 
when  no  reduction  is  done  and  we  iterate  on  the  full  matrix,  yields  a  total  time  of  4NI 
(see  Table  1)    and  uses  N2  processors.     Some  preliminary  reduction,  at  least  to  k  =  N/2, 
decreases  both  the  time  required  and  the  number  of  processors  needed. 
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Figure  1.     Systolic  array  for  unitary  reduction  to  banded  form. 
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Figure  2.     Cells  for  complex  plane  rotation. 


Figure  3.     Matrix  operations  realized  by  the  array. 
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Figure  4.     Systolic  array  for  band  QR  factorization. 
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Figure  5.     Simultaneous  use  of  band  QR  array  for  iteration. 
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Abstract 

Linear  time  computation  of  the  singular  value  decomposition  (SVD)  would  be  useful  in 
many  real  time  signal  processing  applications.     Two  algorithms  for  the  SVD  have  been  devel- 
oped for  implementation  on  a  quadratic  array  of  processors.     A  specific  architecture  is 
proposed  and  we  demonstrate  the  mapping  of  the  algorithms  to  the  architecture.     The  algor- 
ithms and  architecture  together  have  been  verified  by  functional  level  and  register  trans- 
fer level  simulation. 

Introduction 

The  advent  of  VLSI  technology  has  spurred  research  in  massively  parallel  high  speed 
processors.     One  of  the  results  of  this  research  are  systolic1  or  wavefront  array2  proces- 
sors.  This  type  of  processor  has  been  applied  to  a  variety  of  highly  structured  problems, 
including  matrix-vector  multiplication1,  matrix-matrix  multiplication1,  LU  decomposition3, 
QR  decomposition3,  LL^  (Choleski)  decomposition2,   digital  filtering1*,   convolution1*,  and 
correlation1*.     One  significant  omission  from  this  list  of  applications  is  the  singular  value 
decomposition  (SVD) .     We  will  present  two  new  SVD  algorithms  and  their  associated  array 
architecture . 

Notation 

The  following  notation  is  used  throughout  this  paper: 
a^  represents  the  i      row  of  the  matrix  A. 
<a-^,aj>  represents  the  inner  product  of  vectors  a^  and  aj  .  ^ 
Q  is  an  orthogonal  matrix  of  the  appropriate  dimension,   i.e. ,   Q  Q  =  I 
((•))n  denotes  modulo  n  and 

L'J  is  the  floor  function,   i.e.   the  greatest  integer  less  than  or  equal  to  the  argument. 

The  singular  value  decomposition 

The  SVD  is  a  factorization  of  an  m  x  n  matrix  (m>n)   into  two  orthogonal  matrices     and  a 
diagonal  matrix,  i.e. 

A  =  UEVT 

,  .   ^rnxn    TT  ^mxm     _  nraxn    T7  nnxn     TTTTT  T       ttT,,_t   /     \   _^_^       ^„  ^«  ^ 

where  AeR  ,  UeR  ,  ZcR  ,  VeR  ,  U  U=I,  V  V=I ,  S-diag(a^)  a -,>o  2^_-  •  ■  >o  >0  °r+l=  •  •  =an=^  • 
We  call  U  a  matrix  of  left  singular  vectors  and  V  a  matrix  of  right  singular  vectors. 

The  SVD  is  closely  related  to  the  eigenvalue  decomposition  of  a  symmetric  matrix.  The 
singular  values  are  the  square  roots  of  the  eigenvalues  of  AA^  or  ATA  and  the  left  (right) 
singular  vectors  are  the  eigenvectors  of  AA^  (ATA) ,  viz. 

AAT  =   (UEVT)    (UEVT)T  =  UZZTUT  , 

and  ATA  =  (UEVT)T  (UZVT)  =  VETIVT  . 

The  SVD  also  corresponds  to  the  eigenvalue  decomposition     of  the  matrix  r0  ATn 

LA  0  J 

We  let     U  =  [UpUj],  with  U^R1^11  ,     and  U2eRmx(m-n) 
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and 
then 


Q  = 


1 

V 

/z 

u 

"o 

A 

T- 

A 

0 

V 
-Ui 


0 

u2 


Q  =  diag(a1( 


'  n  ] 


•a   ,0. . . 0) 
n 


In  practice  the  SVD  should  not  be  computed  by  the  eigenvalue  decomposition.     The  explicit 
formation  of  aTa  is  a  numerically  poor  technique  since  elements  of  0(vT)  of  the  matrix  A, 
where  e  is  the  machine  precision,  may  be  lost  in  the  formation  of  A^A.6    A  doubling  of  data 
width  would  be  required  to  compensate  for  this  deficiency.     The  eigenvalue  decomposition  of 
r0  AT-, should  not  be  used  because  of  the  doubling  of  the  problem  size. 
LA  0  J 


Another  useful  characterization  of  the  SVD,  used  in  image  processing, 
sum  of  outer  products  of  singular  vectors : 
r 

A  =  E  a 
i=l 


^  u^  where  U  =  [  u^  ,      ,  .  .  .  u  ]  and  V  =  [  v^  ,  v2  , 


is  as  a  weighted 


v  ] 
nJ 


The  SVD  is  a  powerful  mathematical  tool  that  is  potentially  useful  in  many  signal  proc- 
essing applications.     Speiser  and  Whitehouse7  have  shown  that  the  SVD  could  be  used  in 
adaptive  filtering  and  data  compression.     Andrews  and  Patterson8  demonstrated  the  use  of 
the  SVD  in  image  enhancement  and  restoration.     Some  other  applications  available  from  the 
SVD  are: 

•  Orthogonal  Procrustes  Problems  in  Factor  Analysis6'9,  i.e.   find  an  orthogonal  matrix  Q 

such  that   [ | A- BQ | |p  =  minimum, where   || • I lp  is  the  Frobenius    norm  of  a  matrix. 

•  Subspace  Determination6'10     (null  space,  principal  angles,   intersections)     We  note  that 

the  SVD  is  a  more  appropriate  and  powerful  technique  for  a  null  space  computation  than 
the  procedure  used  by  Kung2 . 

•  Rank  Estimation9 

•  Best  approximation  of  a  given  matrix  by  a  matrix  of  a  fixed  lower  rank9 . 

•  Pseudo  Inverse  Computation9 

•  Least  Squares  Solutions9 

•  Canonical  Correlation  Computations 10 

A  linear  time  SVD  processor  would  allow  the  realization  of  many  previously  intractable 
real  time  signal  processing  problems.     However,   the  SVD  has  not  been  used  to  date  in  real 
time  applications  because  the  SVD  algorithms  are  computationally  intensive  and  do  not  ex-  . 
hibit  parallelism. 

T 

Our  algorithms  do  not  change  the  problem  size,   avoid  the  explicit  formation  of  A  A,  and 
are  designed  for  parallel  processing.     Assuming  msn,   computation  of  the  SVD  requires  an 
iterative  cubic  algorithm.     Therefore,   a  linear  time  algorithm  requires  at  least  a  quad- 
ratic number  of  processors. 

The  quadratic  array  of  processors 

Advances  in  VLSI  have  made  it  feasible  to  implement  systems  of  thousands  of  processing 
elements   (PEs) .     The  quadratic  array  of  processors  is  a  two  dimensional  array  of  PE's 
subject  to  the  capabilities  and  constraints  of  VLSI  technology.     Inherent  VLSI  PE  array 
issues  include  limited  interprocessor  communication,   design  complexity,  and  programmability . 


Design  complexity  can  be  managed  by  requiring  a  system  to  be  composed  of  many  instances 
of  a  single  processor  replicated  in  a  regular  structure.     This  requirement,   in  turn,  alters 
the  general  issue  of  programmability  to  the  problem  of  mapping  highly  structured  parallel 
algorithms  onto  the  processor  array. 

The  planar  nature  and  limited  number  of  conducting  levels  of  VLSI  circuitry  mandate  a 
planar  interconnection  graph.     Subject  to  the  three  constraints  described  below  it  turns 
out  that  there  are  only  three  possible  interconnection  graphs.     The  constraints  are  that 
the  shape  of  a  processor  is  bounded  by  a  regular  convex  polygon,   that  the  polygons 
tesselate  the  plane,   and  that  the  interconnection  graph  is  the  dual  of  the  graph  formed  by 
tesselating  the  plane.     Actually,   there  are  only  two  interconnection  structures  because  one 
of  the  structures  is  simply  a  subdivision  of  another.     The  two  remaining  structures  are  the 
orthogonal  and  hexagonal  nearest  neighbor  meshes  shown  in  Figure  1.     Both  structures  lack 
global  communication  which  is  a  severe  restriction  for  the  computation  of  the  SVD. 
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Orthogonal  Nearest  Neighbor  mesh  Hexagonal  Nearest  Neighbor  mesh 

FIGURE  1 

SVD  algorithms 

There  are  several  sequential  algorithms  for  the  computation  of  the  SVD,  among  them  the 
Golub  and  Reinsch  EISPACK  algorithm12,  various  Lanczos  algorithms13,  and  similarity  trans- 
forms for  an  equivalent  eigenproblem .  None  of  these  methods  seem  to  be  amenable  to  quad- 
ratic array  processor  implementation.  The  EISPACK  algorithm  requires  global  communication 
The  Lanczos  algorithm  is  a  two  step  method  and  requires  complicated  control  for  the  occa- 
sional reorthogonalization.  Similarity  transform  methods  (two  sided  transforms)  are  inher- 
ently serialized  by  the  data  dependencies  of  row  and  column  operations. 

The  SVD  algorithms  we  have  derived  are  variants  of  a  method  due  to  Hestenes11*.  This 
method  was  originally  investigated  because  it  is  a  one  sided  transform,   i.e.   it  requires 
only  premultiplication  or    postmultiplication  but  not  both.     In  addition,  Luk15  used 
Hestenes'  method  on  the  Illiac  IV,  which  has  an  orthogonal  nearest  neighbor  communication 
network.     Luk's  implementation  required  only  limited  interprocessor  communication. 


Hestenes'  method  is  an  iterative  cubic  algorithm  that  forces     pairwise  row  or  column 
orthogonality  by  orthogonal  plane  rotations   (Jacobi  rotations) .     A  simplified  version  of 
the  algorithm  is  shown  in  Figure  2.     Hestenes'  method  implicitly  computes  the  eigenvalue 
decomposition  of  AA^  without  actually  forming  AA^. 


Let 


r 


Q  = 


^ii  =  =  cos  6 

q±j  =  -qji  =  sin  6 
qkk  =  1  kM,j 
0  otherwise 


i.e.,  a  rotation  in  the  (i,j)  plane,  where  8  is  chosen  such  that  [ (QTA)  (QTA)T]ij=0. 
If  successive  iterates  are  formed  as 


then 


P 

T       _  «T.  »T 


Vi  Vi  =  wS>  qp 


hence  0     is  an  orthogonal  similarity  transform  of  A  A  and 

fWl^        T  ,,PP. 

Lim    A  =    rT     =  ZV    where  a.  =  w.M0 

p-oo      P  w2 
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Given  a  matrix  AeR™*;11  this  algorithm  computes  orthogonal  matrices  qT.Qo,...   such  that 

if  U=Q1Q2...   then  UTA=W  where  W  =  IVT,   i.e.   o±  =   | | w. TTo 


pi 

■*-  <a. 
1 

,a. >     i=l ,  .  .  .  ,n 

/* 

compute 

(ATA)  .  . 

hi 

do 

sweep 
do  i  = 

=0,   1,  ... 
■  1   n-1 

/* 

iterate 

until  convergence 

do 

j  =  i+1 , . . . , n 

/* 

for  all 

PE1  s 

a  ■*■  2<a  .  ,  a .  > 
i  J 

/* 

compute 

2(ATA) . . 
orthogonal 

if  a=0  then  skip 

/* 

skip  if 

else 

3  *  pi"pi 

/* 

compute 

rotation 

(«2  +  32)1/2 
if  6>0  then 

C  -  ((6+6)/(26))1/2 
s    *■  a/(2«Sc) 


s    <-  ((6-6)/(26))1/2 


else 

c    +■  a/(26s) 


2  2 
c  s 


/*  apply  rotation  to 
/*  rows  i  and  j 


end 

end 

end 


,\  /(c) (s)a  A 
j.J   +  i-(c)(s)aj 


2  2 

Pj/     \    s        c   /\P^/       V"(c)(s)a/  /     update  diagonal  elements 


FIGURE  2     HESTENES'  METHOD 


Hestenes '  method  is  numerically  stable  and  is  ultimately  quadrat ically  convergent.  The 
method  automatically  produces  one  set  of  singular  vectors.     The  other  set  may  be  determined 
by  accumulating  the  plane  rotations,  by  using  the  defining  relation:     A=UE\/T(   or  by  comput- 
ing the  SVD  of  the  transpose  of  the  original  matrix.     Hestenes'  method  was  widely  used  on 
sequential  computers  before  the  advent  of  the  Golub  and  Reinsch  EISPACK  algorithm.  The 
former  method  was  then  superceded  because  it  requires  three  times  as  many  floating-point 
operations  as  the  latter  procedure.     However,   this  factor  of  three  is  a  small  price  for  us 
to  pay  here  for  we  are  able  to  develop  parallel  algorithms  based  on  Hestenes'  idea. 

Hestenes'  method,   as  shown  in  Figure  2,  requires  the  computation  of  square  roots. 
Wilkinson  has  shown  there  exist  square-root-free  approximations  to  the  rotation  angles  that 
will  not  eliminate  the  methods  ultimate  quadratic  convergence. 

Architecture 


Our  proposed  architecture  is  shown  in  Figure  3.     It  consists  of  a  memory,  a  memory  inter- 
face, a  multiplexor,  and  a  triangular  array  of  identical  PE's.     The  PE's  are  strictly  ortho- 
gonally interconnected  by  two  unidirectional  data  paths  and  a  unidirectional  control  ring 
in  each  direction.     Data  initially  enters  the  PE  array  via  the  multiplexor  from  the  memory 
interface.     The  data  flows  to  the  right  until  it  reaches  a  PE  on  the  diagonal.     The  PE  on 
the  diagonal  reflects  the  data  such  that  it  flows  downward  to  eventually  exit  from  the 
bottom  of  the  array.     Data  from  the  bottom  of  the  array  immediately  reenters  the  array  via 
the  multiplexor  until  the  algorithm  terminates.     The  PE  at   (i,j)  processes  data  from  row  i 
and  row  j  because  of  the  downward  reflection  of  data  from  row  j .     The  memory  interface 
schedules  data  to  and  from  memory  such  that  the  data  is  initially  skewed  in  time.     That  is, 
the  first  element  of  row  i  is  read  from  memory  at  time  i.     The  initial  skew  of  data  is  such 
that  at  PE(i,j)   the  column  numbers  of  the  data  from  rows  i  and  j   is  always  the  same.  Thus, 
a  PE  is  supplied  with  the  correct  data  to  compute  inner  products  and  plane  rotations. 

A  diagram  of  one  PE  is  shown  in  Figure  4.     Each  PE  has  four  data  registers  and  two  con- 
trol registers  that  serve  as  stable  platforms  for  information  from  its  above  and  left 
neighbors.     Each  PE  also  has  a  few  words  of  working  storage,   an  ALU,   a  microprogram  memory, 
a  current  state  register,   and  state  change  logic.     A  PE  cycles  through  a  small  number  of 
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functional  states.     For  instance,   in    one  method     a  PE  iterates  on  a  three  state  sequence 
of  inner  product,   plane  rotation,  and  convergence  test.     Each  data  element  has  an  associated 
one  bit  tag;   data  and  tag  move  concomitantly  through  the  array.     These  data  tags  are  the 
primary  sequencing  mechanism  for  the  distributed  control  scheme.     Global  control  functions 
such  as  "reset"  are  conveyed  by  the  control  ring.     Each  PE  computes  its  next  state  based  on 
the  data  tags,   current  state,  and  control  register  values. 

Figures  3  and  4  are  pedagogical  in  nature.     An  implementation  of  our  architecture  in  a 
current  technology  would  probably  use  a  byte,  nibble,  or  bit  serial  PE  to  maximize  inte- 
gration.    In  addition,   the  communication  paths  would  be  time  multiplexed  and  of  small  width 
to  reduce  area  and  balance  communication  time  with  computation  time. 


MEMORY 


FIGURE  3     Proposed  Architecture  FIGURE  4     Processing  Element 


New  algorithms 

Hestenes'  method  cannot  be  implemented  on  the  proposed  architecture  without  modi- 
fication.    His  method  computes  a  plane  rotation  to  zero  out   (A™A)j_j,   and  this 
computation  requires  the  elements   (A^A)j_ j  ,    (A^A)j_j_,   and  (ATA)jj,   i.e.   <a^,aj>,   <a^,a-L>,  and 
<aj,aj>.     Application  of  the  rotation  changes  a^  and  aj .     Thus  any  other  inner  products 
involving  a^  must  wait  for  the  completion  of  the  inner  products  <a^,aj>  and  <a^,ai>  and 
the  start  of  the  rotation.     This  serialization  of  inner  products  hinders  a  linear  time  com- 
putation . 

Linear  time  computation  requires  the  parallel  computation  of  <ai,ai>,   <aj,aj>,  and 
<a^,aj>  for  all  i  and  j.     The  diagonal  elements,   <ai,ai>  and  <aj,aj>,   can  be  updated  in 
unit  time  from  their  previous  values,   the  rotation  angle,  and  <ai,aj>.     In  order  to  realize 
this  computation  on  the  array,   the  diagonal  elements  must  be  passed  around  the  array  so 
that  the  latest  values  are  immediately  available.     This  is  done  with  the  second  of  the  two 
data  paths  in  each  direction.     Thus  the  only  remaining  problem  is  the  parallel  computation 
of  <ai(ak>,     k  =  i+1,  n. 

We  have  developed  two  algorithms ,  Methods  I  and  II,   that  avoid  the  serialization  implicit 
in  Hestenes'  method.     Both  algorithms  compute  inner  products  of  data,  some  or  all  of  which 
may  be  up  to  one  sweep  older  than  the  equivalent  computation  by  Hestenes'  method.  Essen- 
tially, Methods  I  and  II  compute  approximations  to  the  angle  used  in  a  Jacobi  rotation. 

Method  I,   shown  in  Figure  5,   can  best  be  explained  by  analogy  to  Hestenes  method.  Method 
I  is  similar  to  Hestenes'  method  except  that  all  inner  products  are  computed  first  and  then 
all  rotations  are  determined  and  performed.     The  initiation  of  computations  by  the  distrib- 
uted control  scheme  can  be  described  by  the  wavefront  conceDt  described  by  S.Y.  Kung2 . 
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Computation  of  an  inner  product  or  applications  of  a  rotation  is  initiated  by  one  wavefront 
and  terminated  by  the  next  where  the  wavefronts  are  separated  by  n  time  steps.     The  wave- 
fronts  originate  in  the  upper  left  corner  of  the  array  and  progress  one  unit  in  each 
direction  each  time  step.     This  corresponds  to  wavefronts  generated  by  a  point  source  of 
period  n  with  a  box  metric,  i.e., 

dj  [(0,0) , (i, J)]  =  1 x I  +  | j | 

The  data  tags  represent  an  embodiment  of  these  wavefronts  if  the  tags  associated  with  the 
first  column  of  data  are  on    and  all  other  tags  are  off. 

Pi  ■*■  <ai,ai>       i  =  1  n  /*  compute   (A  A)..  ¥. 

do  sweep  =  0 ,   1 ,  ... 

do  i  =  1  n-1 

doj=i  +  l   n 

T 

ouj  «-  2<ai,aj>  /*  compute   (A  A)^.  -Vi ,  j 

end 

end 

do  i  =  1 , . . . ,  n-1 

do  j   =  i  +  1, . . . ,  n  /*  for  all  PE ' s 

if  ou..  =  0  then  skip  /*  skip  if  already  zero 

else 

3  +■  P-j-p-i  /*  compute  rotation  based  on 

2       2  1/2 

6  *■  (ouj+8  )  /*  precomputed  a 

if  g>0  then 

1/2 


c     -  ((6+3)/(26)) 

else 


s     ■*-  a^j  /  (26c) 


s     -  ((6-6)/(26))1/2 
c    *■  oi£j/(26s) 

a  A      i    c  s  \     /aA  /*  aPPly  rotation  to  rows  i 


a-/      I   -s  c  /  \a. 


/*  and  j 


c2         s2  \    fPi\       f  (c)(s)« 


c2  /    \Py    +  \- (c)  (s)       j  /*  update  diagonal  elements 


end 

end 

end 


FIGURE  5     METHOD  I 


Method  II,   shown  in  Figure  6,   is  a  more 
will  be  explained  geometrically  by  defining 
stant  in  time  based  on  the  wavefronts  and  d 


complex  approximation  than  Method  I.     Method  II 

the  computation  of  each  processor  at  an  in- 
istance  metric. 


The  wavefronts  corresponding  to  Method  II  are  generated  by  a  periodic  point  source  at 
the  origin  (i.e.   upper  left  hand  corner)   in  the  metric 

dT1  [(0,0) ,    (I,J)]  =  I  +  2J 

where  I  increases  downward,   starting  from  zero  and  J  increases  to  the  right,   also  starting 
from  zero.     An  example,  where  n=8,   is  shown  in  Figure  7.     These  wavefronts  are  equivalent 
to  the  tags  of  diagonal  data  elements  being  on  and  all  others  off.  The  wavefronts  move  two 
down  to  every  one  across.     In  fact,   the  tags  denote  leading  and  trailing  wavefronts  due  to 
the  downward  reflection  of  the  data  by  PE's  on  the  diagonal.     The  leading  wavefronts  are 
shown  in  solid  lines  and  the  trailing  wavefronts  are  shown  in  dashed  lines  in  Figure  7. 
The  trailing  wavefronts  are  not  required  by  the  distributed  control  scheme  and  are  ignored. 

Because  wavefronts  initiate  state  transitions  they  partition  the  array  into  sets  of  PE's 
performing  different  operations.     After  a  startup  transient  the  PE's  execute  a  state 
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sequence  of  inner  product,  plane  rotation,   diagonal  element  recomputation ,  and  convergence 
test.     To  differentiate  between  states  we  define  an  index  for  each  PE  as  the  difference  be- 
tween the  time  and  distance  from  the  origin. 

K  =  Clock  -  dI];[(0(0)  ,  (I  ,J)] 

Where  Clock  is  the  number  of  unit  time  steps  from  the  beginning  of  the  simulation.  The 
index  k  increases  uniformly  with  time.     We  decompose  k  into  a  quotient  and  remainder  when 
divided  by  n. 

k  =  nL-J  +  ((k)) 
LnJ       vv  '  n 

The  quotient  assigns  successive  integers  to  the  wavefronts  and  the  remainder  identifies  the 
n  individual  steps  within  a  state.     A  wavefront     is  equivalent  to   ((k))n=0.     The  startup 
transient  occurs  when  k  is  less  than  zero,   i.e.   the  first  wavefront  has  yet  to  arrive.  The 
quotient  modulo  four,    ((1&J))4  successively  denotes  the  four  states  a  PE  iterates  through 
while  in  steady  state.     The  algorithm  definition,  Figure  6,   shows  how  these  indices  are 
used . 


P  .       <a .  ,  a  >       i  =  1 ....  ,n 
do  sweep  =  0 ,   1 ,  ... 

do  clock  =  0,   1  4n-l 

do  i  =  0   n-1 

do  j  =  0 , . . .  ,  i 

k  «-  4n (sweep)  +  clock 
if  k<0  then  skip 
else 

j  -  ((j+k))n^ 
if  j=0  then  j 
if  ((|k/nj))4 


(i+2j) 


n 

0  then 


end 


end 


end 


if  ((k))n=0  then  IP^ 


else 


IPij 


IP. .  +  a. .  a. ; 
iJ         iJ  JJ 


if  ((|k/nj))4  =  1  then 
if  (00)  =0  then 
Compute  c , s 
Update  p . , p . 


else  skip 


a .  c 
JJ 


else 

if((|k/n|))4 
if  (00) 


iJ 


IP.  . 

iJ 

else 


n 
IP. 


2  then 
0  then  IP. 


iJ 


iJ 


+  3iJ 


a .  v 
JJ 


if  ((k))  =0  and  i=j  then 
n  J 


.  .-(-IP.  . 
i  iJ 

Test  for  convergence  and  pass  result 


/*  compute   (A  A)^  ^ 

/*  iterate  to  convergence 

/*  An  clocks  per  sweep 

/*  for  all  PE's 

/*  compute  index 

/*  skin  before  first  wavefront 

/*  compute  data  column  index 

/*  state  =   inner  product 
/*  initialize  on  wavefront 
/*  accumulate  inner  product 

/*  state  =  rotate 

/*  on  wavefront  compute  c,s, 

/*  Pi.Pj 

/*  as  in  Hestenes'  method 
/*  rotate  two  data  elements 


/*  state  -  recompute  diagonal 
/*  same  as  inner  produce  above 

/*  state  =  convergence  test 
T 

/*  save  (A  A)^ 

/*  IP^tolerance=>converged 


end 


FIGURE  6     METHOD  II 
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6,2 

6,3 

A" 

7,2 

7,3 

7.7 


FIGURE  7     N=8  Processing  Element  Array 


Results 


Numerical  experiments  have  been  performed  to  verify  that  the  algorithms  and  architecture 
correctly  compute  the  SVD.     The  correct  SVD  was  computed  by  the  EISPACK  algorithm,  a  se- 
quential version  of  Hestenes'  method,   and  from  theoretical  results.     Functional  level  simu- 
lation code  for  both  Methods  I  and  II  were  written  and  tested.     Simplified  versions  of  this 
code  form  the  basis  for  the  algorithms'  description  in  the  previous  section.     Unit  delay 
register  transfer  level  simulation  programs  were  written  for  the  architecture  with  both 
algorithms . 

Register  transfer  level  simulation  is  possible  because  the  system  consists  of  many  in- 
stances of  only  three  distinct  components.     Each  distinct  component  has  been  implemented  as 
a  separate  subroutine  or  set  of  nested  subroutines.     This  approach  insures  a  hierarchial 
separation  of  function.     Time  separation  is  accomplished  by  context  switch  between  old  and 
new  system  states  on  each  clock  pulse. 

Test  matrices  from  Gregory  and  Karney17,    Wilkinson1"*,    and  matrices  of  random  numbers 
have  been  chosen.     These  tests  have  included  both  rank  deficient  and  full  rank  matrices, 
matrices  of  distinct  and  multiple  singular  values,   and  matrices  of  order  up  to  twenty  three. 
In  all  cases  both  algorithms  converged  to  the  SVD  with  errors  of  the  order  of  the  machine 
precision.     This  evidence  of  convergence  is  empirical.     We  have  not  yet  developed  a  formal 
proof  of  convergence. 

Conclusions 


We  have  investigated  linear  time  computation  of  the  SVD.     A  linear  time  solution  requires 
a  quadratic  number  of  processors  because  the  SVD  is  of  cubic  complexity.     Feasibility  con- 
straints on  the  implementation  of  a  quadratic  number  of  processors  have  led  us  to  the  choice 
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of  a  systolic  array  architecture.     This  architecture  features  identical  PE ' s ,  a  distributed 
control  scheme,   and  the  potential  for  asynchronous  timing. 

One  limitation  of  systolic  arrays  is  the  critical  algorithm-architecture  interaction. 
Data  dependencies  or  global  communication  requirements  in  computation  of  the  SVD  have  pre- 
vented the  mapping  of  classical  algorithms  to  our  architecture.     By  adopting  a  less  effic- 
ient algorithm  as  a  basis  we  have  developed  two  new  algorithms  that  can  be  computed  on  a 
systolic  array.     Thus  we  can  now  solve  real  time  problems  that  require  the  SVD  such  as 
image  enhancement  and  data  compression.     We  can  also  implement  systems  that  require  real 
time  subspace  or  null  space  computations,  pseudo  inverse  computations,   or  least  squares 
solutions . 

Acknowledgment 

This  research  was  supported  in  part  by  the  University  Research  Program  of  the  General 
Electric  Company.     The  work  of  the  second  author  was  also  supported  in  part  by  the  U.S. 
Army  Research  Office  under  grant  DAAG  29-79-C0124. 

References 

1.  Kung,  H.T.  and  Leiserson,   C.E.,   "Systolic  Arrays   (for  VLSI)",   Sparse  Matrix  Proceed- 
ings 1978,  Duff,   I.S.   and  Stewart,  G.W.,  eds . ,   SIAM  1979,  pp.  256-282. 

2.  Kung,   S.Y.,   "VLSI  Array  Processor  for  Signal  Processing",   Proc.   Conf .   on  Advanced 
Research  in  Integrated  Circuits,  M . I . T .,  Cambridge ,  MA,   Jan.   28-30,  1980. 

3.  Gentleman,  W.M.  and  Kung,  H.T.,   "Matrix  Triangularization  by  Systolic  Arrays", 
Proc.   SPIE  Symp.,  Vol     298,  Real  Time  Signal  Processing  IV,   SPIE  1981. 

Zk     Kung,   H.T. ,   "Why  Systolic  Architectures",   ComputerT  Vol .   15,   No.   1,   1982,  pp. 37-46. 

5.  Lanczos,   C,     Linear  Differential  Operators ,  Van  Nostrand  Co.,  London,  1961 

6.  Golub,  G.H.,   and  VanLoan ,   C.F.,  Advanced  Matrix  Computations,   to  be  published 
Johns  Hopkins  Press. 

7.  Speiser,   J.M.  and  Whitehouse,   H.J.,   "Architectures  for  Real  Time  Matrix  Operations", 
Proc.   1980  Government  Microcircuit  Applications  Conf.,  Houston,  TX,  Nov.   19-21,  1980. 

8~]  Andrews  ,  H .  C  .  and  Patterson,  C .  L .  ,  "Singular  Value  Decomposition  and  Digital  Image 
Processing",   IEEE  Trans.  Acoustics,   Speech  and  Signal  Processing,  ASSP-24,   1976,  po .  26-53. 

9.  Golub ,  G . H .  and  Luk ,  F . T . ,  "Singular  Value  Decomposition:  Applications  and  Compu- 
tations", ARO  Report  77-1     Trans,   of  22nd  Conf.   of  Army  Mathematicians,   1977,  pp.  577-605. 

10.  Bjorck.A.   and  Golub,  G.H. ,   "Numerical  Methods  for  Computing  Angles  between  Linear 
Subspaces",  Math .   Comp   ,  Vol.   27,   1973,  pp.  579-594. 

11.  Heller^   D . E .   and  Ispen,   I.C.F.,   "Systolic  Networks  for  Orthogonal  Decompositions  with 
Applications",   Tech.   Report  CS-81-18,   Computer  Science  Dept.,   Penn.   State  Univ.,  University 
Park,   PA,  1981. 

12.  Golub,  G.H.   and  Reinsch,   C,   "Singular  Value  Decomposition  and  Least  Squares  Solu- 
tions",  Numer.  Math. ,  Vol.   14,   1970,  pp.  403-420. 

13.  Golub ,   G . H . ,   Luk,  F.T.  and  Overton,  M.L.,   "A  Block  Lanczos  Method  for  Computing  the 
Singular  Values  and  Corresponding  Singular  Vectors  of  a  Matrix",  ACM  Trans.  Math.  Software, 
Vol.   7,  No.   2,   1981,  pp.  149-169. 

14.  Hestenes,  M.R.,   "Inversion  of  Matrices  by  Biorthogonalization  and  Related  Results", 
Journal  SIAM,  Vol.   6,   1958,  pp.  51-90. 

TIT     Luk ,  F . T . ,   "Computing  the  Singular  Value  Decomposition  on  the  Illiac  IV",  ACM  Trans. 
Math.   Software,  Vol.   6,   1980,  pp.  524-539. 

TE~.     Wilkinson,  J.H.,  The  Algebraic  Eigenvalue  Problem,   1965,   Clarendon  Press,  Oxford. 

17.     Gregory,   R.T.   and  Karney,  D.L. ,  A  Collection  of  Matrices  for  Testing  Computational 
Algorithms ,   1969,  John  Wiley,   New  York. 


SPIE  Vol  34 1  Real  Time  Signal  Processing  V 11982)  /  43 


Synchronizing  large  systolic  arrays 


Allan  L.  Fisher,  H.  T.  Kung 

Department  of  Computer  Science,  Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania  15213 


Abstract 

Parallel  computing  structures  consist  of  many  processors  operating  simultaneously.  If  a  concurrent  structure  is  regular,  as  in  the 
case  of  a  systolic  array,  it  may  be  convenient  to  think  of  all  processors  as  operating  in  lock  step.  This  synchronized  view,  for 
example,  often  makes  the  definition  of  the  structure  and  its  correctness  relatively  easy  to  follow.  However,  large,  totally 
synchronized  systems  controlled  by  central  clocks  are  difficult  to  implement  because  of  the  inevitable  problem  of  clock  skews  and 
delays.  An  alternative  means  of  enforcing  necessary  synchronization  is  the  use  of  self-timed,  asynchronous  schemes,  at  the  cost  of 
increased  design  complexity  and  hardware  cost.  Realizing  that  different  circumstances  call  for  different  synchronization  methods, 
this  paper  provides  a  spectrum  of  synchronization  models:  based  on  the  assumptions  made  for  each  model,  theoretical  lower 
bounds  on  clock  skew  are  derived,  and  appropriate  or  best-possible  synchronization  schemes  for  systolic  arrays  are  proposed.  In 
general,  this  paper  represents  a  first  step  towards  a  systematic  study  of  synchronization  problems  for  large  systolic  arrays. 

One  set  of  models  is  based  on  assumptions  that  allow  the  use  of  a  pipelined  clocking  scheme,  where  more  than  one  clock  event 
is  propagated  at  a  time.  In  this  case,  it  is  shown  that  even  assuming  that  physical  variations  along  clock  lines  can  produce  skews 
between  wires  of  the  same  length,  any  one-dimensional  systolic  array  can  be  correctly  synchronized  by  a  global  pipelined  clock 
while  enjoying  desirable  properties  such  as  modularity,  expandability  and  robustness  in  the  synchronization  scheme.  This  result 
cannot  be  extended  to  two-dimensional  arrays,  however — the  paper  shows  that  under  this  assumption,  it  is  impossible  to  run  a 
clock  such  that  the  maximum  clock  skew  between  two  communicating  cells  will  be  bounded  by  a  constant  as  systems  grow.  For 
such  cases  or  where  pipelined  clocking  is  unworkable,  a  synchronization  scheme  incorporating  both  clocked  and  "asynchronous" 
elements  is  proposed. 

1.  Introduction 

Parallel  computing  structures  consist  of  many  processors,  or  cells  in  the  terminology  of  this  paper,  operating  simultaneously.  If 
a  concurrent  structure  is  regular,  as  in  the  case  of  a  systolic  array  ,  it  may  be  convenient  to  think  of  all  cells  as  operating  in  lock  step. 
This  synchronized  view,  for  example,  often  makes  the  definition  of  the  structure  and  its  correctness  relatively  easy  to 
follow — indeed,  synchronized,  moving  transparencies  are  typically  used  in  talks  to  illustrate  systolic  arrays.  Perhaps  the  simplest 
means  of  synchronizing  an  ensemble  of  cells  is  the  use  of  broadcast  clocks.  A  clocked  system  in  general  consists  of  a  collection  of 
functional  units  whose  communication  is  synchronized  by  external  clock  signals.  A  variety  of  clocking  schemes  are  possible;  the 
essential  point  is  that  by  referring  to  the  global  time  standard  represented  by  the  clock,  communicating  cells  can  agree  on  when  a 
cell's  outputs  should  be  held  constant  and  when  a  cell  should  be  sensitive  to  its  input  wires.  When  different  cells  receive  clock 
signals  by  different  paths,  they  may  not  receive  clocking  events  at  the  same  time,  potentially  causing  synchronization  failure.  These 
synchronization  errors  due  to  clock  skews  can  be  avoided  by  lowering  clock  rates  and/or  adding  delay  to  circuits,  thereby  slowing 
the  computation.  The  usual  clocking  schemes  are  also  limited  in  performance  by  the  time  needed  to  drive  clock  lines,  which  will 
grow  as  circuit  feature  size  shrinks  relative  to  total  circuit  size.  Therefore,  unless  operating  at  possibly  unacceptable  speeds,  very 
large  systems  controlled  by  global  clocks  are  difficult  to  implement  because  of  the  inevitable  problem  of  clock  skews  and  delays. 

An  alternative  approach  is  self-timing  ,  in  which  cells  synchronize  their  communication  locally  with  some  variety  of 
"handshaking"  protocols.  It  is  easy  to  convince  oneself  that  any  synchronized  parallel  system  where  processors  operate  in  lock  step 
can  be  converted  into  a  corresponding  asynchronous  system  of  this  type  that  computes  the  same  output — the  asynchronous  system 
is  obtained  by  simply  letting  each  processor  start  computing  as  soon  as  its  inputs  become  available  from  other  processors.  The 
self-timed,  asynchronous  scheme  can  be  costly  in  terms  of  extra  hardware  and  delay  in  each  cell,  but  it  has  the  advantage  that  the 
time  required  for  a  communication  event  between  two  cells  is  independent  of  the  size  of  the  entire  processor  array.  A  serious 
disadvantage  of  fully  self-timed  systems  is  that  they  are  difficult  and  expensive  to  design  and  test. 

An  advantage  that  self-timed  systems  often  enjoy,  in  addition  to  the  absence  of  clock  skew  problems,  is  a  performance 
advantage  that  results  from  each  cell  being  able  to  start  computing  as  soon  as  its  inputs  are  ready  and  to  make  its  outputs  available 
as  soon  as  it  is  finished  computing.  This  allows  a  machine  to  take  advantage  of  variations  in  component  speed  or  data-dependent 
conditions  allowing  faster  computation.  This  advantage  will  seldom  exist  in  systolic  systems,  however,  for  two  reasons: 
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•  Usually,  each  cell  in  a  systolic  array  performs  the  same  kind  of  computation  as  every  other  cell;  thus  there  is  little 
opportunity  for  speed  variation. 

•  In  cases  where  variations  do  exist,  the  throughput  of  computation  along  a  path  in  an  array  is  limited  by  the  slowest 
computation  on  that  path.  The  probability  that  a  worst-case  compulation  will  appear  on  a  path  with  k  cells  is  l—pk. 
where  p  is  the  probability  that  any  given  cell  will  not  be  performing  a  worst-case  compuuilion.  This  quantity 
approaches  unity  as  k  grows,  so  large  arrays  will  usually  be  forced  to  operate  at  worst-case  speeds. 

The  result  of  these  considerations  is  that  clocking  is  generally  preferable  to  self-timing  in  the  synchronization  of  systolic  arrays. 
The  techniques  described  below  use  clock-based  approaches,  sometimes  with  a  self-timed  assist,  to  allow  convenient 
synchronization  of  large  arrays. 

2.  Basic  Assumptions 

The  basic  model  that  we  will  use  for  considering  synchronization  of  systolic  arrays  is  as  follows: 

(Al)  Inter-cell  data  communications  in  an  ideally  synchronized systolic  array,  in  which  all  processors  operate  in  lock  step,  are 
defined  by  a  directed  graph  COMM.  which  is  laid  out  in  the  plane.  Each  node  of  COMM.  also  called  a  cell,  represents 
a  cell  of  the  systolic  array,  and  each  directed  edge  of  COMM,  called  a  communication  edge,  represents  a  wire  capable  of 
sending  a  data  item  from  the  source  cell  to  the  target  cell  in  every  cycle  of  the  system.  Any  two  cells  connecting  by  a 
communication  edge  are  called  communicating  cells. 

(A2)  A  cell  occupies  unit  area 

(A3)  A  communication  edge  has  unit  width. 

We  now  add  assumptions  which  provide  the  basis  for  clocked  implementations  of  ideally  synchronized  arrays. 

(A4)  A  clock  for  a  clocked  systolic  array  is  distributed  by  a  rooted  binary  tree  CLK,  which  is  also  laid  out  in  the  plane.  A  cell 
of  COMM  can  be  clocked  if  the  cell  is  also  a  node  of  CLK. 

(A5)  A  clocked  system  may  be  driven  with  clock  period  8  +  A  +  t,  where  5  is  the  maximum  clock  skew  between  any  two 
communicating  cells,  A  is  the  maximum  time  for  a  cell's  outputs  to  be  computed  and  propagated,  and  t  is  the  time  to 
distribute  a  clocking  event  on  CLK. 

This  assumption  can  be  justified  by  appeal  to  a  more  detailed  model  which  deals  with  the  periods  of  time  in  which  cells  hold 
their  output  edges  invariant  or  are  sensitive  to  the  values  on  their  input  edges.  The  constraints  between  clock  events,  which  are 
enforced  in  implementation  by  the  pattern  of  the  clock  signals  and  circuit  delays,  may  be  adjusted  so  that  any  communicating  pair 
is  properly  synchronized  with  a  clock  period  8  +  A  +  t.  Induction  on  the  size  of  an  array  then  shows  that  the  clocked  system 
correctly  implements  the  ideally  synchronized  array. 

Note  that  if  we  adopt  the  usual  convention  that  the  clock  tree  is  brought  to  an  equipotential  state  before  a  new  clock  event  is 
transmitted,  eliminating  clock  skew  can  lead  only  to  a  constant  factor  increase  in  performance,  since  it  must  always  be  true  that 
8  <  t.  In  particular,  speed  of  light  considerations  impose  the  followmg  condition: 

(A6)  The  time  t  required  to  distribute  a  clocking  event  on  a  clock  tree  CLK  in  a  particular  layout  is  bounded  below  by  a-P, 
where  a  >  0  is  a  constant  and  P  is  the  (physical)  length  of  a  longest  root-to-leaf  path  in  CLK. 

Thus,  since  the  clock  tree  must  reach  each  cell  in  the  array,  large  arrays  which  are  synchronized  by  equipotential  clocking  must 
have  clock  periods  at  least  proportional  to  their  layouts'  diameters.  Note  that  in  the  remainder  of  this  paper,  we  will  relate 
transmission  delays  to  wire  length;  delays  are  caused  by  other  factors,  of  course,  but  we  choose  to  treat  them  together  as  a 
"distance"  metric. 

In  the  case  where  an  array  grows  too  big  for  its  clock  tree  to  be  driven  at  the  desired  speeds  due  to  the  time  needed  to  bring  long 
wires  to  an  equipotential  state,  it  is  possible  to  take  advantage  of  the  propagation  delay  down  a  long  wire  by  having  several  clock 
cycles  in  progress  along  its  length  .  The  electrical  problems  of  passing  a  clean  signal  in  this  fashion  are  severe,  due  to  analog 
phenomena  such  as  damping  and  reflections.  We  can  instead  simulate  this  behavior  by  replacing  long  wires  with  strings  of  buffers, 
which  will  restore  signal  levels  and  prevent  backward  noise  propagation.  These  buffers  are  spaced  a  constant  distance  apart;  a  good 
candidate  is  that  distance  which  will  cause  wire  delays  between  buffers  to  be  of  the  same  size  as  a  buffer's  propagation  delay.  This 
allows  us  to  replace  assumption  (A6)  with  the  following: 


The  authors  were  told  that  this  "pipelined"  form  of  clocking  was  actually  implemented  in  some  high-speed  CDC  machines. 
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(A7)  If  CLK  is  a  buffered  clock  tree,  the  lime  t  required  to  distribute  a  clocking  event  on  a  particular  unbuffered  segment  of 
CLK  is  the  maximum  delay  through  a  buffer  and  its  output  wire.  Thus,  t  is  a  constant  independent  of  the  size  of  the 
array. 

To  ensure  that  successive  clock  events  remain  correctly  spaced  along  the  clock  path,  we  make  the  following  assumption: 

(A8)  The  time  for  a  signal  to  travel  on  a  particular  path  through  a  buffered  clock  tree  is  invariant  over  time. 

The  following  section  describes  two  models  based  on  the  above  assumptions,  and  Sections  4  and  5  explore  the  problem  of 
clocking  under  these  models.  Section  6  considers  the  case  where  assumption  (A8)  does  not  hold,  and  hence  condition  (A6)  holds 
rather  than  condition  (A7). 

3.  Two  Models  of  Clock  Skew 

Given  a  basic  model  consisting  of  conditions  (Al)  through  (A5),  plus  (A7)  and  (A8),  the  following  sections  consider  the 
implications  of  two  models  of  clock  skew.  First,  in  Section  4  we  consider  the  case  where  clock  skew  between  two  cells  depends  on 
the  difference  in  their  physical  distance  from  the  root  of  the  clock  tree.  This  difference  model  corresponds  reasonably  well  with  the 
practical  situation  in  high  speed  systems  made  of  discrete  components,  where  clock  trees  are  often  wired  so  that  delay  from  the  root 
is  the  same  for  all  cells.  More  formally,  we  assume  the  following: 

(A9)  The  clock  skew  between  two  nodes  of  CLK.  with  respect  to  a  given  layout,  is  bounded  above  by  f(d),  where /is  some 
monotonically  increasing  function  and  d  is  the  positive  difference  between  the  (physical)  lengths  of  the  paths  on  CLK 
that  connect  the  two  nodes  to  the  root 

This  assumption  is  illustrated  in  Figure  1.  The  two  circles  connected  by  the  dashed  line  have  clock  skew  between  them  which  is 
no  more  than  a  constant  times  the  length  of  the  crosshatched  segment.  This  segment  represents  the  difference  between  the  cells' 
distances  to  their  nearest  common  ancestor  in  the  clock  tree. 


Figure  1:  Skew  in  the  difference  model. 

As  systems  grow,  small  variations  in  electrical  characteristics  along  clock  lines  can  build  up  unpredictably  to  produce  skews  even 
between  wires  of  the  same  length.  In  the  worst  case,  two  wires  can  have  propagation  delays  which  differ  in  proportion  to  the  sum 
of  their  lengths.  Especially  since  it  is  not  possible  to  tune  the  clock  network  of  a  system  on  a  single  chip.  Section  5  considers  a 
model  in  which  the  skew  between  two  nodes  depends  on  the  distance  between  them  along  the  clock  tree.  Formally,  the  summation 
model  (so  called  because  the  distance  between  two  nodes  is  the  sum  of  their  distances  from  their  nearest  common  ancestor,  while 
the  difference  measure  used  above  is  the  difference  between  those  distances)  uses  the  following  upper  and  lower  bound 
assumptions: 

(A10)  The  clock  skew  between  two  nodes  of  CLK.  with  respect  to  a  given  layout,  is  bounded  above  by  g(s)  where  g  is 
some  monotonically  increasing  function  and  s  is  the  (physical)  length  of  the  path  on  CLK  that  connects  the  two  nodes. 

(All)  The  clock  skew  between  two  nodes  of  CLK,  with  respect  to  a  given  layout,  is  bounded  below  by  fi-s  where  /?  >0 
is  some  constant  and  s  is  the  (physical)  length  of  the  path  on  CLK  that  connects  the  two  nodes. 

Figure  2  illustrates  these  assumptions;  here  both  the  upper  and  lower  bounds  on  the  skew  between  the  two  communicating  cells 
depend  on  the  entire  length  of  the  path  between  them,  which  is  the  sum  of  their  distances  to  their  nearest  common  ancestor  in  the 
tree. 

The  two  models  of  clock  skew  introduced  above  can  be  formally  derived  as  follows,  for  the  case  when  both  functions  /and  g  are 
linear.  Let  h:  and  h2,  with  hx  >  h2,  be  the  distances  of  any  two  cells  to  their  nearest  common  ancestor  in  the  clock  tree.  Let  m  +  e 
and  m-e  be  the  maximum  and  minimum  time,  respectively,  to  transmit  a  clock  signal  across  a  wire  of  unit  length,  where  e 
corresponds  to  the  variations  in  electrical  characteristics  along  clock  lines.  Then  the  clock  skew  between  the  two  cells  can  be  as 
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Figure  2:  Skew  in  the  summation  model. 

large  as 

clock  skew  =  h{(m  +  e)  —  h2(m  —  e)  =  (hi  —  h2)m  +  (h1  +  h2)e. 

Noticing  that  d=h1  —  h2,  s=  h:  +  h2,  and  s >  d>  0,  we  have 

(m  +  e)  s  >  clock  skew  =  m-d+  e-s  >  e-s. 

We  see  that  the  upper  and  lower  bounds  correspond  directly  to  assumptions  (A10)  and  (All)  used  in  the  summation  model, 
whereas  the  difference  model  corresponds  to  the  case  when  terms  involving  e  can  be  ignored. 

4.  Clocking  under  the  Difference  Model 

Assuming  the  basic  model  defined  above  along  with  condition  (A9),  which  states  that  the  skew  between  two  cells  is  bounded  by 
a  function  of  the  difference  between  their  distances  from  the  root,  it  is  apparent  that  no  clock  skew  will  accur  if  we  assure  that  all 
nodes  in  COMM  are  equidistant  (with  respect  to  the  clock  layout)  from  the  root  of  CLK.  This  can  be  achieved  for  any  layout  for 
COMM  of  bounded  aspect  ratio,  without  increasing  the  area  of  the  layout  by  more  than  a  small  constant  factor,  by  distributing  the 
clock  through  an  H-tree  .  This  scheme  is  illustrated  for  linear,  square,  and  hexagonal  arrays  in  Figure  3,  in  which  heavy  lines 
represent  clock  edges  and  thin  lines  represent  communication  edges. 


(a)  (b)  (c) 

Figure  3:  H-tree  layouts  for  clocking  (a)  linear  arrays,  (b)  square  arrays,  and  (c)  hexagonal  arrays. 

More  precisely,  we  have  the  following  result: 

Lemma  1 :  For  any  given  layout  of  bounded  aspect  ratio,  it  is  possible  to  run  a  clock  tree  such  that  all  nodes  in  the 
original  layout  are  equidistant  (with  respect  to  the  clock  tree)  from  the  root  of  the  tree,  and  the  clock  tree  takes  an  area 
no  more  than  a  constant  times  the  area  of  the  original  layout 

By  a  theoretical  result4  that  any  rectangular  grid  can  be  embedded  in  a  square  grid  by  stretching  the  edges  and  the  area  of  the 
source  grid  by  at  most  a  constant  factor,  we  have  the  following  theorem: 

Theorem  2:  Under  the  difference  model  of  clock  skew,  any  ideally  synchronized  systolic  array  with  computation 
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and  communication  delay  A  bounded  by  a  constant  can  be  simulated  by  a  corresponding  clocked  system  operating 
with  a  clock  period  independent  of  the  size  of  the  array,  with  no  more  than  a  constant  factor  increase  in  layout  area. 

5.  Clocking  under  the  Summation  Model 

This  section  relaxes  the  assumption  of  the  previous  section  by  using  the  summation  model  rather  than  the  difference  model  for 
clock  skews.  The  clock  skew  between  two  nodes  of  CLK.  with  respect  to  a  given  layout,  is  related  to  the  (physical)  length  of  the 
path  on  CLK  that  connects  the  two  nodes.  Note  that  because  the  summation  model  is  weaker  than  the  difference  model,  any 
clocking  scheme  working  under  the  summation  model  must  also  work  under  the  difference  model.  The  reverse  of  the  statement  is 
not  true,  however.  For  example,  the  clocking  scheme  illustrated  in  Figure  3(a)  for  linear  arrays  may  not  work  under  the  summation 
model,  since  two  communicating  cells  (such  as  the  two  middle  cells  on  the  left  side  of  the  layout)  could  be  connected  by  a  path  on 
CLK  whose  length  can  be  arbitrarily  large  as  the  size  of  the  array  grows.  In  the  following  we  give  another  clocking  scheme  for 
linear  arrays  that  works  even  under  the  summation  model  for  clock  skew;  in  addition,  we  show  that  it  is  impossible,  under  this 
model,  to  clock  a  two-dimensional  array  in  time  independent  of  its  size.  In  this  sense,  linear  arrays  are  especially  suitable  for 
clocked  implementation. 

5.1.  Clocking  one-dimensional  systolic  arrays 

Given  any  ideally  synchronized  one-dimensional  systolic  array  (Figure  4  (a)),  we  propose  a  corresponding  clocked  array 
(Figure  4  (b))  obtained  by  running  a  clock  wire  along  the  length  of  the  one-dimensional  array. 


(a)  data 


clock  - 

(b) 

data  z 


Figure  4:  (a)  Ideally  synchronized  one-dimensional  systolic  array  and  (b)  corresponding  clocked  array. 

By  (A10)  the  maximum  clock  skew  between  any  two  neighbors  is  bounded  above  by  a  constant  g(s),  where  s  is  the  center-to-center 
distance  between  neighboring  cells.  Thus  we  have  the  following  result: 

Theorem  3:  Under  the  summation  model  of  clock  skew,  any  ideally  synchronized  one-dimensional  systolic  array 
with  computation  and  communication  delay  A  bounded  by  a  constant  can  be  simulated  by  a  corresponding  clocked 
system,  as  illustrated  in  Figure  4,  operating  at  a  clock  period  independent  of  the  size  of  the  array. 

Skew  between  the  host  and  the  ends  of  the  array  can  be  handled  similarly  by  folding  the  array  in  the  middle  (Figure  5),  and  the 
array  can  be  laid  out  with  any  desired  aspect  ratio  by  using  a  comb-shaped  layout  (Figure  6). 

With  the  clocking  schemes  illustrated,  we  see  that  the  clock  period  for  any  one-dimensional  systolic  array  can  be  made 
independent  of  the  size  of  the  array.  As  a  result,  the  clocked  array  may  be  extended  to  contain  any  number  of  cells  using  the  same 
clocked  cell  design.  Therefore,  we  can  say  that  these  clocked  schemes  are  most  suitable  for  synchronizing  one-dimensional  arrays 
due  to  their  simplicity,  modularity  and  expandability.  Note  that  one-dimensional  arrays  are  especially  important  in  practice 
because  of  their  wide  applicabilities  and  their  bounded  I/O  requirements  . 

5.2.  A  lower  bound  result  on  clock  skew 

We  show  here  that  the  result  of  Theorem  3  for  the  one-dimensional  array  cannot  be  extended  to  two-dimensional  structures. 
Consider  any  layout  of  an  nxn  array  and  a  global  clock  tree  CLK  whose  nodes  include  all  cells  of  the  array.  Let  8  be  the  maximum 
clock  skew  between  two  communicating  cells  of  the  array.  We  want  to  prove  that  8  can  not  be  bounded  above  by  any  constant 
independent  of  n.  We  use  the  following  well  known  result  : 

Lemma  4:  To  bisect  an  nxn  mesh-connected  graph  at  least  c  n  edges  have  to  be  removed,  where  c>  0  is  a  constant 
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Figure  5:  Array  folded  lo  bound  skew  with  host. 


Figure  6:  Comb  layout 


independent  of  n. 

Bisecting  a  graph  means  partitioning  the  graph  into  two  subgraphs,  each  containing  about  half  of  the  nodes  of  the  original  graph. 
Here  for  the  nxn  mesh-connected  graph  we  assume  that  none  of  the  subgraphs  contain  more  than  (23/30)-/r  nodes.  We  also  use 
the  following  trivial  but  useful  lemma  without  giving  a  proof. 

Lemma  5:  For  any  subset  M  of  nodes  of  a  binary  tree,  there  exists  an  edge  of  the  tree  such  that  its  removal  from  the 
tree  will  result  in  two  disjoint  subtrees,  each  having  no  more  than  two-thirds  of  the  nodes  in  M. 

The  n2  cells  of  the  nxn  array  form  a  subset  of  nodes  of  CLK.  By  Lemma  5  we  know  that  by  removing  a  single  edge,  CLK  can  be 
partitioned  into  two  disjoint  subtrees  such  that  each  subtree  has  no  more  than  (2/3)- n2  cells.  Denote  by  A  and  B  the  sets  of  cells  in 
the  two  subtrees.  Let  u  be  the  root  of  the  subtree  that  contains  cells  in  A.  Consider  the  circle  centered  at  u  and  with  radius  8/fi, 
where  fi  is  defined  in  (Al  1).  If  there  are  >  (1/10)- n:  cells  inside  the  circle,  then  by  (A2) 

77(5 //3)2>  «2/10,     or  5  =  Q(n), 
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and  thus  5  cannot  be  bounded  above  by  any  constant  independent  of  n.  Suppose  now  that  there  are  <(1/10)-ai2  cells  inside  the 
circle.  Note  that  any  of  those  cells  in  A  which  are  outside  the  circle  cannot  reach  any  cell  in  B  by  a  path  on  CLK  with  (physical) 
length  <  8/ ft.  Thus  these  cells  cannot  have  any  communicating  cells  in  fl(with  respect  to  the  nxn  array),  since  by  (All)  the  clock 
skew  between  these  cells  and  any  cell  in  B  is  >  ft  -8/ ft  =  5  and  the  clock  skew  between  any  two  neighboring  cells  is  assumed  to  be 
<  8.  These  sets  are  illustrated  in  Figure  7(a).  LeM  be  the  union  of  A  and  the  set  of  cells  in  the  circle,  and  B  be  B  minus  the  set  of 
cells  in  the  circle.  See  Figure  7(b).  Then  A  and  B  form  a  partition  of  the  nxn  array,  and  each  of  them  has  no  more  than  (l/lOVn2 
+  (2/3)-/f  =  (23/30)- n2  cells.  From  Figure  7(b).  we  see  that  any  edge  in  the  nxn  array  connecting  a  cell  in  ^and  a  cell  in  B  must 
cross  the  boundary  of  the  circle.  Since  the  length  of  the  boundary  is  277 67/3,  by  (A3)  A  and  B  are  connected  by  no  more  than 
Imb/ft  edges.  By  Lemma  4  we  have  2u8/ft  >  c-n,  or 

6  =  Q(n). 

Therefore  as  n  increases,  5  grows  at  least  at  the  rate  of  n:  we  see  that  it  is  impossible  to  run  a  global  clock  for  the  nxn  array  such 
that  the  maximum  clock  skew  6  between  communicating  cells  will  be  bounded  above  by  a  constant,  independent  of  n. 


(a)  (b) 

Figure  7:  (a)  original  partition  and  (b)  new  partition  of  the  communication  graph. 

The  above  proof  for  two-dimensional  mesh  graphs  can  be  generalized  to  deal  with  other  classes  of  graphs.  For  the 
generalization,  we  need  to  define  the  minimum  bisection  width  of  a  graph6,  which  is  the  number  of  edge  cuts  needed  to  bisect  the 
graph.  For  example,  by  Lemma  4  the  minimum  bisection  width  of  an  nxn  mesh-connected  graph  is  O(n).  We  have  the  following 
general  result: 

Theorem  6:  Suppose  that  the  minimum  bisection  width  of  an  /V-node  graph  is  £2(  W{N))  and  W(N)  =  0(y/7F  )• 
Then 

6  =  tt(W(N)). 

Since  under  the  summation  model  of  clock  skew  two-dimensional  nxn  systolic  arrays  cannot  be  efficiently  implemented  by 
clocked  controls,  their  implementation  should  be  assisted  by  some  self-timed  scheme  as  discussed  in  the  next  section. 

6.  Hybrid  Synchronization 

In  the  absence  of  the  invariance  condition  (A8).  provisions  must  be  made  to  ensure  that  a  clock  event  does  not  "catch  up  with"  a 
previous  event.  This  requires  that  each  clock  buffer  refrain  from  passing  on  an  event  until  the  processing  of  the  previous  event  has 
been  acknowledged.  In  order  to  implement  this  constraint,  we  can  essentially  replace  the  buffers  of  the  previous  sections  with  a 
handshaking  network  which  operates  on  clock  events. 

In  this  approach,  we  break  up  the  layout  into  bounded-size  segments,  and  provide  each  segment  with  a  local  clock  distribution 
node.  The  clock  distribution  nodes  employ  a  handshaking  protocol  to  pass  clock  events  among  themselves.  Given  assumptions 
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about  the  maximum  delay  of  a  computation  node  and  its  wires  and  the  maximum  delay  for  a  handshake  transaction  in  the  clock 
distribution  network,  we  can  clock  the  cells  in  each  neighborhood  in  constant  time.  As  before,  we  balance  the  delay  within  each 
element  widi  the  wire  delays  between  elements.  This  structure  is  illustrated  in  Figure  8,  in  which  the  heavy  lines  and  black  boxes 
represent  the  self-limed  synchronization  network,  and  the  narrow  lines  represent  local  clock  distribution  to  the  cells  near  each 
synchronizing  element. 


Figure  8:  Hybrid  synchronization  scheme. 

This  provides  the  performance  of  a  self-timed  system  by  making  all  synchronization  paths  local,  while  isolating  the  self-timed 
logic  to  a  small  subsystem  and  allowing  the  computational  elements  to  be  designed  as  if  the  entire  system  were  globally  clocked. 
The  hybrid  approach  has  the  additional  advantage  that  a  single  synchronization  design  can  be  used  for  many  different  structures. 
This  simplification  of  the  usual  self-timed  scheme  is  made  possible  by  the  fact  that  we  are  willing  to  assume  a  maximum  delay  for 
the  computational  elements;  this  is  the  same  assumption  made  in  ordinary  clocked  schemes.  Note  that  we  are  willing  to  let  the 
entire  array  operate  at  worst-case  cell  speed,  since  even  a  fully  self- timed  array  would  usually  wind  up  operating  at  that  speed 
regardless. 

7.  Concluding  Remarks 

We  have  described  a  series  of  models  in  which  synchronization  schemes  can  be  studied,  and  have  indicated  some  of  the 
implications  of  these  models.  Future  work  should  include  refinement  of  the  models  and  some  quantification  of  when  they  apply  to 
real  systems,  as  well  as  further  work  on  Jheir  implications.  This  paper  has  concentrated  on  the  interaction  of  clock  skew  models 
with  the  communication  structure  of  arrays  with  bounded  communication  delay;  future  work  should  also  examine  cases  where 
asymptotically  growing  delays  occur. 

One  interesting  such  case  is  that  where  the  communication  graph  COMM.  neglecting  edge  directions,  is  a  binary  tree.  It  has 
been  shown  that  a  planar  layout  of  a  tree  with  N  nodes  of  unit  area  must  have  an  edge  of  length  £l(V  N  /  log  N)  .  Under  the 
summation  model  of  Section  5,  then,  if  we  make  the  additional  assumption  that  communication  delays  are  proportional  to  path 
length,  a  tree  may  be  clocked  at  no  loss  in  asymptotic  performance  simply  by  distributing  clock  events  along  the  data  paths. 

o 

Furthermore,  if  COMM  is  acyclic,  as  in  the  tree  machine  algorithms  described  in  a  paper  by  Bentley  and  Kung  ,  and  the  ratio 
between  lengths  (in  the  layout)  of  any  two  edges  at  the  same  level  in  the  graph  is  bounded,  pipeline  registers  can  be  added  on  the 
long  edges,  with  the  same  number  of  registers  on  all  of  the  edges  in  a  given  level.  This  makes  all  wires  have  bounded  length,  thus 
causing  the  time  needed  for  a  cell  to  operate  and  pass  on  its  results  to  be  independent  of  the  size  of  the  tree.  Adding  the  registers 
increases  the  layout  area  by  at  most  a  constant  factor,  since  they  in  effect  just  make  wires  thicker.  For  example,  an  H-tree  layout 
has  this  property,  and  allows  a  tree  machine  of  N  nodes  to  be  laid  out  in  area  O(N)  with  delay  through  the  tree  of  0(\Z~N~ )  and 
constant  pipeline  interval. 
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Abstract 


This  paper  compares  timing  and  other  aspects  of  a  synchronous  and  asynchronous 
square  array  of  processing  elements,  fabricated  by  means  of  VLSI  technology.  Timing  models 
are  developed  for  interprocessor  communications  and  data  transfer  for  both  cases.  The 
synchronous  timing  model  emphasizes  the  clock  skew  phenomenon,  and  enables  derivation  of  the 
dependence  of  the  global  clock  period  on  the  size  of  the  array.  This  0(N**3)  dependence, 
along  with  the  limited  flexiblity  with  regards  to  pr ogrammab il i ty  and  extendabil ity,  call 
for  a  serious  consideration  of  the  asynchronous  configuration.  A  self  timed  (asynchronous) 
model,  based  on  the  concept  of  wavefront  oriented  propagation  of  computation,  is  presented 
as  an  attractive  alternative  to  the  synchronous  scheme.  Some  potential  hazards,  unique  to 
the  asynchronous  model  presented,  and  their  solutions  are  also  noted. 


1 .  Introduction 


The  availability  of  low  cost,  fast  VLSI  (Very  Large  Scale  Integration)  devices 
promises  the  practice  of  cost-effective,  high  speed,  parallel  processing  of  large  volumes  of 
data.  The  traditional  design  of  parallel  computers  is  becoming  unsuitable  for  the  design  of 
highly  concurrent  VLSI  computing  processors.  It  usually  suffers  from  heavy  supervisory 
overhead  incurred  by  synchronization,  communication  and  scheduling  tasks,  which  severely 
hamper  the  throughput  rate  which  is  critical  to  real-time  signal  processing.  In  fact,  these 
are  the  key  barriers  inherent  in  very  large  scale  computing  structure  design.  Moreover, 
though  VLSI  provides  the  capability  of  implementing  a  large  array  of  processors  on  one  chip 
it  imposes  its  own  constraints  on  the  system.  Large  design  and  layout  costs  [1]  suggest  the 
utilization  of  a  repetitive  modular  structure.  In  addition,  communication,  which  costs  the 
most  in  VLSI  chips,  in  terms  of  area,  time  and  energy,  has  to  be  restricted  (to  localized 
communication).  In  general,  highly  concurrent,  systems  require  this  locality  property  in 
order  to  reduce  interdependence  and  ensuing  waiting  delays  that  result  from  excessive 
communication  [2].  Moreover,  this  locality  constraint  may  further  render  the  utilization  of 
centralized  control  and  global  synchronization  less  appealing.  As  a  result,  the  use  of 
asynchronous,  distributed  control  and  localized  data  flow  may  become  a  more  effective 
approach  to  the  design  of  very  large  scale,  highly  concurrent  computing  structures.  This 
paper  will  attempt  to  present  some  key  factors  affecting  the  tradeoff  between  the  two  timing 
schemes.  y 

The  timing  framework  is  a  very  critical  issue  in  designing  the  system,  especially 
when  one  considers  large  scale  computational  tasks.  Two  opposite  timing  schemes  come  to 
mind,  namely  the  Synchronous  and  the  Asynchronous  timing  approaches.  In  the  synchronous 
scheme,  there  is  a  global  clock  network  which  distributes  the  clocking  signals  over  the 
entire  chip.  The  global  clock  beats  out  the  rythm  to  which  all  the  processing  elements  in 
the  array  execute  their  sequential  tasks.  All  the  PEs  operate  in  unison,  all  performing  the 
same,  identical,  operation.  In  contrast,  the  asynchronous  scheme  involves  no  global  clock, 
and  information  transfer  is  by  mutual  convenience  and  agreement  between  each  processing 
element  and  its  immediate  neighbors.  Whenever  the  data  is  available,  the  transmitting  PE 
informs  the  receiver  of  that  fact,  and  the  receiver  accepts  the  data  whenever  it  is 
convenient  for  it  do  so.  This  scheme  can  be  implemented  by  means  of  a  simple  handshaking 
protocol . 

A  proper  timing  model,  which  includes  all  lines  of  communication  between  adjacent 
processing  elements,  is  essential  for  timing  analysis  and  comparison  between  the  synchronous 
and  the  asynchronous  approaches.  Numerous  studies  on  the  tradeoffs  between  the  general  two 
schemes  have  been  carried  out.  Seitz  [5],  [2,  chap. 7]  observes  that,  for  large  scale 
systems,  synchronous  timing  approaches  pose  increasing  difficulties.  An  interesting 
analysis  is  carried     out  by  Franklin  [6]  as  far  as  comparision     of  the  two  timing  schemes  is 
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concerned.  Franklin's  models  represent  multiple  VLSI  chip  interconnection  schemes.  In-depth 
studies  of  the  physical  processes  involved  in  transfer  of  information  between  adjacent 
transistors  in  neighbouring  processing  elements  on  the  chip,  which  are  applicable  to  VLSI 
oriented  timing  discussions,  have  been  implemented  [7]. 

A  general  representation  of  all  VLSI  systems  is  to  view  them  as  a  computation  graph. 
The  graph's  nodes  represent  devices,  which  compute  boolean  functions,  and  its  arcs  are 
wires,  which  are  responsible  for  information  transfer  and  distribution  of  power  and  timing 
waveforms . 

Several  computing  structures,  which  are  designed  to  meet  the  constraints  and  goals 
of  VLSI  technology,  have  been  suggested.  Since  the  VLSI  technological  constraints  render 
the  general  purpose  array  processor  rather  inefficient,  a  special  class  of  applications, 
i.e.  recursive  and  local  data  dependent  algorithms,  has  become  a  focus  of  discussion  [3,4]. 
Accordingly,  we  shall  limit  ourselves  to  matrix  related  configurations,  similar  to  that  of 
Pig.  1,  and  we  narrow  down  the  structures  to  two  major  groups.  The  systolic  array  is  an 
example  of  a  totally  synchronous  system,  and  is  typical  of  the  first  group.  The  second 
group,  consisting  of  asynchronous  timing  scheme  configurations,  is  represented  by  data  flow 
machi  nes  . 

The  Systolic  Array  ,  introduced  by  H.T.Kung  and  C . E. Le i serson  [ 2 , chap . 8. 3 ] ,  was  one 
of  the  first  systematic  attempts  to  harness  VLSI  power  in  the  service  of  coraputationaly 
intensive  tasks.  Their  design  consists  of  a  regular  array  of  identical  processors  in  a 
linearly  connected  or  a  hexagonally  mesh-connected  geometry.  The  linear  array  is  suited 
(and  dedicated)  to  matrix-vector  multiplication  and  solution  of  triangular  linear  systems, 
while  the  hex-connected  array  is  used  for  matrix  multiplication  and  LU  decomposition  type 
problems.  Prom  the  timing  point  of  view,  the  systolic  array  is  wholly  synchronous,  and 
requires  global  clock  distribution,  therefore  suffering  the  inherent  penalty  of  all 
synchronous  systems,  that  of  being  able  to  clock  at  only  the  rate  determined  by  the  slowest 
element  in  the  array  and  by  the  system  clock  skew. 

Another  feature  of  the  systolic  array  is  its'  total  dedication  to  implementing  a 
given  algorithm.  The  array  is  not  programmable,  and  each  algorithm  requires  a  seperate  and 
distinct  array  configuration.  Also,  because  of  the  strict  synchronized  timing,  all  of  the 
processors,  except  perhaps  some  special  peripheral  elements,  must  be  performing  the  same 
task  in  unison.  There  is  no  room  for  multi-tasking,  even  if  the  tasks  are  serial  and  not 
interwoven  one  into  the  other.  The  systolic  array  concept,  therefore,  involves  an 
inflexibility  which  might  reduce  its  scope  of  applications. 

The  Data  Flow  Machine  is  asynchronous  and  consists  of  numerous  processing  elements 
(PEs)  which  can  operate  independently.  A  main  feature  of  the  data  flow  machine  is  that  an 
instruction  is  ready  for  execution  when  its  operands  are  available.  There  is  no  concept  of 
control  flow,  nor  is  there  a  program  location  counter.  A  consequence  of  this  philosophy  is 
that  many  instructions  may  be  available  for  execution  at  once,  and  the  rate  of  throughput 
will  depend  on  the  availability  of  processing  resources.  A  weakness  of  the  data  flow 
concept  lies  in  the  fact  that  much  managing  and  bookkeeping  are  required  to  allocate  the 
resources  efficiently.  The  decomposition  process,  for  example,  which  decomposes  a  program 
into  its  concurrent  components,  is  time  consuming,  requires  an  extensive  operating  system 
and,  above  all,  demands  that  each  PE  be  aware  of  the  global  systems'  resources,  how  many  PEs 
are  available,  the  sizes  of  their  respective  local  memories  and  their  workload.  This  is  the 
penalty  of  pursuing  general  purpose  computing  features  while  attempting  efficient 
parallelism  and  concurrency.  It  also  limits  the  ability  of  the  general  purpose  data  flow 
machine  to  be  implemented   in  VLSI. 


FETCH    A(k);  (*  Prom  the  left  *) 

FETCH    B(k);  (*  From  above  *) 

MULTIPLY     D   :=  A( k)*B(k) ; 
ADD        C( k)    : =  C( k-1 )  +  D ; 

FLOW      A(k);  (*  To  the  right  *) 

FLOW      B(k);  (*  Downward  *) 


Fig.   1:   Square  Array  of  Processors  Pig-   2:   Operation  Sequence  of  Matrix  Multiplication 


As  matrix  operations  form  the     backbone  for  the  parallel  array  structures  dealt  with 
in  this     paper,  we     will,  throughout     this  paper,     employ  the    basic  matrix  multiplication, 
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C  :=  A*B,  as  a  bench  test  for  evaluation  and  comparison  of  the  timing  characteristics  of  our 
synchronous  and  asynchronous  models.  The  nature  of  matrix  multiplication,  as  implemented 
via  the  wavefront  computation  propagation  concept,  is  described  below.  It  suffices,  at  this 
point,  to  provide  the  basic  operations  executed  by  each  PE  within  the  processor  array.  In 
the  k™  recursion,  a  PE  executes:  C(k)  :=  C(k-1)  +  A(k)*B(k),  where  C( k-1 )  is  the  result 
of  the  k-1  recursion  and  already  resides  within  the  PE,  while  A(  k)  and  B(k)  are  supplied  by 
the.PE's  immediate  neighbors,  as  a  preliminary  to  the  actual  accumulating  product  operation. 
The  sequence  of  operations  is,  therefore,  as  shown  in  Pig.  2.  Some  of  the  operations  in  the 
matrix  multiplication  sequence  can  be  merged,  through  pipelining  and/or  concurrency,  but  all 
must  be  carried  out  as  part  of  the  recursive  task. 


2 .   Timing  Analysis  of  the  Synchronous  Configuration 


In  the  synchronous  scheme,  we  assume  a  global  clock  distribution  network, 
distributing  the  clocking  signals  over  the  entire  VLSI  chip.  In  the  asynchronous  approach, 
information  transfer  is  implemented  by  means  of  a  handshaking  protocol.  The  nature  of  the 
handshaking  protocol  varies  slightly  from  one  system  to  the  next.  We  will  base  our 
presentation  on  a  particular  "wavefront"  oriented  computation  scheme,  described  in  detail 
below. 

The  essential  difference  between  a  synchronous  system  and  an  asynchronous  one  is 
their  timing;  therefore,  the  timing  issues  constitute  the  major  factor  in  defining  the 
performance  evaluation  criteria  of  the  processing  array.  In  addition  to  the  delay  due  to 
timing  considerations,  the  timing  constraints  also  commonly  impose  constraints  on  other 
system  parameters,  such  as  area  allocation,  layout  rules,  power  consumption  and,  of  course, 
overall  performance  of  the  array. 

Timing  aspects  in  the  array  environment  can,  in  general,  be  divided  into  three 
groups:  logic  transition  factors,  including  propagation  delays  within  the  basic  logic 
elements  (gates,  flip  flops,  registers,  etc.),  time  involved  in  charging  the  capacitances  of 
the  data  and  control  distribution  networks  and  time  involved  in  charging  the  clock 
distribution  network.  The  first  two  groups  are  dominant  in  the  asynchronous  systems, 
whereas  all  three  groups  play  a  role  in  the  synchronous  configurations,  especially  the  first 
and  the  third.  For  the  purpose  of  this  paper,  we  will  assume  that  the  time  factors  due  to 
the  basic  logic  elements,  from  the  processing  element  down  to  individual  gates,  is  of  the 
same  order  for  both  synchronous  and  asynchronous  configurations.  Our  main  target  will, 
therefore,  be  the  timing  aspects  that  differentiate  the  two  schemes:  control  and  clock 
timing  involved   in  inter processor  communications  and  data  transfer. 


2.1     Clock  Skew  Phenomena 


In  the  synchronous  global  clock  distribution  network,  one  encounters  the  "Clock 
Skew"  phenomenon,  which  arises  due  to  three  factors.  The  first  source  of  clock  skew  is  the 
RC  of  the  global  distribution  line.  The  second  is  due  to  unequal  clock  paths  to  various  PEs 
in  the  array.  Both  factors  are  a  function  of  the  layout  scheme  of  the  chip.  The  third 
factor  contributing  to  clock  skew  is  the  variance  of  values  of  the  gate  threshold  voltage, 
Vt ,  of  the  PE  gate  which  receives  and  generates  the  global  clock  signal  to  the  interior  of 
the  PE,  thus  serving  as  a  buffer  between  the  global  clock  distribution  network  and  the  local 
clock  distribution  paths.  All  three  factors  are  dependent  on  the  fabrication  process  [6]. 
The  following  analysis  describes  the  essence  of  the  clock  skew  problem. 

The  timing  analysis,  associated  with  the  propagation  of  a  signal  along  a  conducting 
path  between  two  MOS  switches  within  a  chip  is  similar  to  that  of  the  transmission  lines  one 
encounters  in  power  systems.  The  only  difference  is  that  the  line  inductance  is  negligible, 
as  the  rate  of  change  of  currents  is  negligible.  A  detailed  analysis  of  solving  the  partial 
differential  equations  can  be  found  in  [7].  An  important  conclusion  reached  there  is  that 
the  line  capacitance  plays  a  dominant  role  in  the  line  delay.  In  fact,  when  the  signal  of 
interest  is  the  clocking  signal,  it  will,  in  many  cases,  create  disastrous  syncroni zati on 
problems,  particularly  if  the  clock  traverses  a  large  distance  on  the  chip.  This  could 
happen  quite  commonly  in  a  large  VLSI  chip. 

As  crucial  as  the  capacitive  effect  of  the  clock  transmission  line  is  the  resistance 
through  which  that  capacitance  must  charge.  It,  too,  is  dependent  on  the  length  of  the 
line.  Due  to  constraints  involved  in  the  layout  of  the  VLSI  chip,  a  major  factor  in  the 
line  resistance  is  the  distribution  of  the  material  of  which  the  line  is  made.  As  the 
resistance  of  diffusion  is  of  the  order  of  100  times  that  of  metal,  the  length  of  diffusion 
paths  in  the  clock  line  is  a  predominant  factor  in  that  lines'   time  constant. 
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The     equation  which  governs  the     exponential  waveform  of  the     clocking  signal  on  the 

line  is: 

The  time  t,  when  the  gate  switches  its  logic  state,  is  dependent  on  "both  RC  and  the  gate 
threshhold  voltage,  Vt .  This  is  best  shown  by  the  following  graph  depicting  the  dependence 
of  line  voltage  and  time.  This  time  is  dependent  on  the  line  length,  and  is  closely  related 
to  the  ratio  of  diffusion  and  metal  of  the  line. 

The  points  marked  on  figure  3  clearly  indicate  the  total  clock  skew,  dt,  which 
consists  of  the  skew  contributions  due  to  value  variations  of  both  Vt  and  RC .  The 
uncertainty  in  Vt  is  a  fabrication  phenomenon,  e.g.  Vt  may  be  20$  higher  or  lower  than  the 
typical  threshold  voltage,  Vt=2.5  volts,  as  given  in  manufacturers'  data  sheets.  The  above 
figure  also  yields  the  equation: 

dt=clock  skew  (2) 
=t2-t1 

=RC( max)*ln[Vt(max)]  -  RC( min) *ln[ Vt ( min) ] . 
This  establishes  the  clock  skew  as  a  function  of  R,C  and  Vt . 


2.2     A  Timing  Model  for  a  Synchronous  Array 


H  Tree  Clocking  Distribution  The  clock  skew  due  to  unequal  clock  path  lengths  to 

the  PEs  may  become  potentially  hazardous.  It  is, 
therefore,  very  desirable  to  elliminate  this  skew  contributor.  To  this  end,  we  assume  that 
the -array  clocking  network  is  of  an  H-tree  nature  [6].  This  may  be  implemented  by  placing 
the  global  clock  generator  at  the  root  of  a  binary  distribution  tree.  All  the  processing 
elements  are  at  the  various  levels  of  the  tree  as  children  of  preceding  nodes  (see  fig.  4). 
Every  node  represents  a  processor.  This  ensures  the  equal  lengths  of  the  clock  paths  to  all 
the  processors.  Thus,  for  the  purpose  of  reducing  clock  skew,  this  clock  distribution 
scheme  appears  to  be  optimal.  The  clock  skew  between  adjacent  PEs  will,  therefore,  be 
determined  solely  by  eq.  (2). 


V  (max)  -  - 


Vf (min) -  - 


Voltage 

length  l\ 

length  /  2 

/  1 
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-  dt   - 
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Pig.   3'   Clock  Skew  Timing  Diagram  Pig.   4:   H-Tree  Clock  Distribution  Network 

timing  Analysis  The  complete  circuit  analysis  of    the  clock  distribution  network 

is  rather  involved,  so  we  shall  only  summarize  the  final  results 
here,  with  the  details  elaborated  in  Appendix  A.  Let  L  be  the  dimension  of  a  PE.  Then  the 
array  dimension  will  be  N*L.  By  appropriately  modelling  the  H-tree  distribution  network  as 
a  binary  tree  configuration  of  RC  branches  (see  Appendix  A),  it  can  be  shown  that  the  time 
constants,  the  rise  time  and  the  clock  skew  associated  with  the  clocking  signal  distributed 
to  the  PEs,  are  of  0(N**3)!  This  is  corroborated  by  the  simulation  results  provided  below. 
These  results  place  a  severe  restriction  on  the  ability  to  generate  a  global,  synchronous 
clock  signal  in  a  system  with  large  N.  It  should  also  be  noted  that  these  conclusions  are 
based  on  a  full  and  homogenous  binary  distribution  network.  Implementing  a  synchronous  array 
with  an  incomplete  H-tree  distribution,  or  non-homogenous  clock  signal  paths,  will  add  to 
this  clock  skew. 

For  the  purpose  of  simulating  the  distribution  network,  a  value  of  L=0 . 1  mm.  was 
chosen.  Typical  values  for  the  various  physical  parameters  were  applied.  Simulation  was 
carried  out  via  the  SPICE  program,  and  relevant  data  extracted  from  the  timing  graphs.  The 
results  are     depicted   in  Pig.   5,   which  graphs     the  clock  skew  as  a  function     of  N  and  of  the 


56     /  SPIE  Vol.  34 1  Real  Time  Signal  Processing  V  (1982) 


conducting  material  distribution  factor,  r.  To  further  verify  the  results,  a  two-pole 
approximation  of  the  distribution  network,  as  described  in  Appendix  A,  was  executed.  Here, 
too,  the  0(N**3)   dependency  is  evident. 

System  Considerations  Several  system     level  assumptions  are  made  with  regards  to 

the  synchronous  array  in  order  to  enable  a  realistic  eval- 
uation of  its  performance.  First,  the  algorithm  implemented  by  the  array  consists  of 
identical  recursions,  where  each  recursion  involves  several  states  of  the  PE  executing  it. 
For  example,  we  assume  that  addition  requires  Kjadd!  clock  cycles,  multiplication  Kfmult} 
cycles,  data  flow  Kjflow)  cycles,  etc.  Thus,  the  matrix  multiplication  algorithm,  described 
above,  which  calls  for  a  multiplication,  an  addition  and  two  flow/fetch  tasks  would  require 
K{multj+Kiadd}+2*K{flow)  clock  cycles.  Note  that,  whereas  in  the  asynchronous  scheme  flow 
and  fetch  are  seperate  tasks,  in  a  synchronous  array  with  a  two  phase  clock  they  may  be 
combined  into  one  task  that  involves  a  simultaneous  flow-to-left  and  f etch-f rom-right  or 
flow-down  and  f etch-f rom-u p .  This  combination  is  clearly  an  attribute  and  is,  therefore, 
taken  into  consideration  in  the  synchronous  model. 


160. 


Fig.   5:   Clock  Skew  vs.  N  Fig.   6:   Efficiency  vs.  N 

A  single,  independent  PE  allows  operation  with  a  nominal  minimal  clock  pulse  period 
of  T(pe).  To  perform  the  tasks  of  one  recursion,  the  processor  would  require  an  execution 
time  of  [K { mult } +K j add ) +2*K { flow! ] *T ( pe ) .  In  contrast,  the  same  PE  within  the  array  will  be 
clocked  at  intervals  of  T(ck),  where:  T(ck)  -  T(pe)  >  T(skew).  We  introduce  an  efficiency 
factor,  EFF,  which  describes  the  array's  performance  efficiency  based  on  its  timing 
capab  il i t  i  es : 

indenendent     PE  execution  time  of    one  recursion 

EFF  =   (3) 

PE  execution  time  of  one  recursion  within  the  array 

Clearly,  in  the  case  of  the  synchronous  configuration,  EFF  =  T(pe)/T(ck).  Also  very  evident 
is  the  strong  dependency  of  the  efficiency  factor  on  the  clock  skew.  The  relationship 
between  EFF  and  N  is  given  in  Fig.  6,  for  r=0.8  and  T(pe)=20  Nsec.  It  should  be  noted  that 
the  model  presented  is  general  enough  to  include  either  a  programmable  or  an  array  dedicated 
to  the  algorithm. 


3.     Asynchronous  Configuration 


The  problem  of  clock  skew  in  the  synchronous  parallel  processor  array  configuration 
evidently  calls  for  serious  consideration  of  the  asynchronous  alternative.  Systems  which 
are  classified  as  asynchronous,  self-timing  or  data  flow  do  not  suffer  from  the  clock  skew 
problem.     We,  therefore,   turn  out  attention  to  the  asynchronous  timing  scheme. 

It  has  been  shown  that  Data  Flow  Machines  suffer  from  several  key  disadvantages,  due 
to  their  general-purpose  nature.  The  solution  hinges  upon  taking  advantage  af  special  data 
structures     imposed  by  the  class     of  algorithms  considered.     More     precisely,  for  the  matrix 
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operation  class,  we  should  address  the  question  of  wether  it  is  possible  to  generalize  the 
synchronous  (systolic)   array  to  cope  with  with  the  asynchronous  (self-timed)  computation. 

The  answer  to  this  query  is  yes.  However,  the  conceptual  framework  needs  to  be 
changed.  For  this  purpose,  the  so-called  wavefront  notion  becomes  useful  and  essential.  The 
Wavefront  Array  Processor  (WAP)  combines  the  merits  of  both  schemes.  In  other  words,  the 
WAP  can  function  as  a  Systolic  Array  in  one  extreme  and  as  a  Data  Flow  Machine  in  the  other 
extreme.     We  use  the  model  of  the  WAP  to  provide  a  comparision  of  the  timing  schemes. 


3.1     Computational  Wavefront 


For  the  purpose  of  illustrating  the  WAP  concept,  let  us  again  use  multiplication  as 
an  example.  Let  A  =  ja(ij))  and  B  =  |b(ij))  and  C  =  AxB  =  {C(ij)|  all  be  NxN  matrices.  The 
matrix  A  can  be  decomposed  into  columns  Ai  and  matrix  B  into  rows  Bj  ,  and,  therefore, 

C     =    A  B    +  A  „B  +  +  A  B  (4) 

112  2  N  N 

The  matrix  multiplication  can  then  be  carried  out  in  N  recursions,  executing 

C(k)       =      C(k-1)     +    AkBk  (5) 

recursively  for    k  1 ,   2,    .   .   .   ,  N. 

The  topology  of  the  matrix  multiplication  algorithm  can  be  mapped  naturally  onto  the 
square,  orthogonal  NxN  matrix  array  of  the  WAP  (cf.  Fig.  7).  To  create  a  smooth  data 
movement  in  a  localized  communication  network,  we  make  use  of  the  computational  wavefront 
concept.  For  the  purpose  of  this  example,  a  wavefront  in  the  processing  array  will 
correspond  to  a  mathematical  recursion  in  the  algorithm.  Succesive  pipelining  of  the 
wavefronts  will  accomplish  the  computation  of  all  recursions. 

As  an  example,  the  computational  wavefront  for  the  first  recursion  in  matrix 
multiplication  will  now  be  examined.  Suppose  that  the  registers  of  all  the  processing 
elements  (PEs)  are  initially  set  to  zero: 

C(0)  =      0      for  all  (i,j)  .  (6) 

The  entries  of  A  are  stored  in  the  memory  modules  to  the  left  (in  columns),  and  those  of  B 
in  the  memory  modules  on  the  top  (in  rows).     The  process  starts  with  PE  (1,1): 

C(1)1}1    =    C(0)lfl    +    a1>1bljl  (7) 

is  computed.  The  computational  activity  then  propagates  to  the  neighboring  PE ' s  (1,2)  and 
(2,1),  which  will  execute: 


and 


C(1)1,2     =     C(0)l,2     +     ai,l\,2  (8) 

C(1)2j1    =    C(0)2fl    +    a2jlbiji  (9) 

The  next  front  of  activity  will  be  at  PE ' s  (3,1),  (2,2)  and  (1,3),  thus  creating  a 
computation  wavefront  traveling  down  the  processor  array.  It  may  be  noted  that 
wave- propagation  implies  localized  data  flow.  Once  the  wavefront  sweeps  through  all  the 
cells,  the  first  recursion  is  over.  As  the  first  wave  propagates,  we  can  execute  an 
identical  second  recursion  in  parallel  by  pipelining  a  second  wavefront  immediately  after 
the  first  one.     For  example,  the  (1,1)  processor  will  execute 

C(2)  =     C("1  )        +  a      b  (10) 

v    ;1,1  v  1,2  2,1 

C(  k)  .         =     a    b      +  a    b„  +  .  .  . +a    b  (11) 
ij  il  lj         i2  2j  ik  kj 

and  so  on. 

The  pipelining  is  feasible  because  the  wavefronts  of  two  successive  recursions  will 
never  intersect  (Huygen' s  wavefront  principle) ,  as  the  processors  executing  the  recursions 
at  any  given  instant  will  be  different,  thus  avoiding  any  contention  problems. 
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3-2     Asynchronous  (Self-Timing)   Wavefront  propagation 


As  is  evident,  the  pipelining  of  the  wavefronts  can  be  easily  implemented  in  a 
highly  synchronous  fashion,  using  global  clock  distribution  as  discussed  above.  The  result 
of  such  an  implementation  is  referred  to  as  the  synchronous  wavefront  array  processor,  which 
is,  in  fact,  a  variant  of  the  systolic  array.  However,  a  much  more  intriguing  aspect  of  the 
wavefront  notion  lies  in  the  fact  that  it  is  amenable  to  asynchronous  computation.  This  is 
explained  below. 

To  accomplish  a  self-timed,  asynchronous  computation,  the  processors  in  the  array, 
must  wait  for  a  primary  wavefront  (of  data),  then  perform  the  computation  it  calls  for 
and,  finally,  act  as  a  secondary  source  of  new  wavefronts.  For  example,  operations  (8) 
and  (9)  will  not  be  executed  until  PEs  (1,2)  and  (2,1)  confirm  inputting  of  (a11,b12),  and 
(a2itb11)  respectively.  By  the  same  token,  in  the  next  front  of  the  wave,  cells  (1,3), 
(2,2)  and  (1,3)  will  be  involved.  PE  (2,2),  for  example,  has  to  wait  until  PEs  (1,2)  and 
(2,1)  flow  their  data,  b^2  and  a2]_,  respectively.  Only  after  the  arrival  of  that  data  will 
the  (2,2)     cell  execute  its  own  operation:     C(1  )p  2  =  2  +  a21l312     an<^  activate  its  own 

successors,   PEs   (2,3),  and   (3,2).  ' 

The  self-timed  processor  array  essentially  simulates  a  phenomenon  of  wavefront 
propagation.  In  particular,  it  exhibits  the  self-generating  mechanism  of  the  wavefront 
property.  Prom  a  hardware  point  of  view,  in  order  to  implement  this  wait,  processors  are 
provided  with  data  transfer  buffers.  Hence,  a  FETCHing  of  data  involves  an  inherent  WAITing 
for  the  buffer  to  be  filled  (DATA  SENT)  by  the  adjacent,  data  sourcing,  processor.  Thus, 
the  processing  will  not  be  initiated  until  the  arrival  of  the  data  wavefront  (this  is 
similar  to  the  concept  of  data  flow  machines  [  10-18]).  Each  processor  can  FLOW  data  to  the 
input  buffers  of  the  neighboring  PE ' s ,  thus  acting  as  a  secondary  source  of  data-wavef ronts 
(Huygen's  principle).  To  avoid  the  overrunning  of  data  wavefronts  (in  conformation  with 
Huygen's  principle),  the  processor  hardware  ensures  that  a  processor  cannot  send  new  data  to 
the  buffer  unless  the  old  data  has  been  used  by  the  neighbor.  Thus,  the  wavefront  concept 
suggests  that  interprocessor  communication  employ  buffers  and  "DATA  SENT/DATA  USED"  flags 
between  adjacent  processors. 

In  short,  the  handshaking  (wait)  for  wavefronts  of  data  allow  for  globally 
asynchronous  operation  of  processors  i.e.  there  is  no  need  for  global  synchronization. 


3-3     Asynchronous  Wavefront  Array  Processor  Configuration 


To  implement  the  asynchronous  computation,  special  attention  is  given  to  the  PE ' s 
interfacing  with  its  neighbors.  Every  interior  element  has  a  uni-directional  buffer  and 
independent  status  and  control  flags  to  each  of  its  four  adjacent  elements.  These  buffers 
can  be  supported  by  an  appropriate  multiplexing  subsystem,  under  the  control  units' 
supervision . 

The  concept  of  the  asynchronous  approach,  as  implemented  in  the  WAP  architecture, 
are  described  above  and  in  [8,9]-  This  scheme  involves  a  handshaking  protocol  between  the 
adjacent  processing  elements.  The  protocol  ensures  the  regularity  and  the  continuity  of  the 
flow  of  information  through  the  processor  array.  This  calls  for  an  additional  number  of 
input  and  output  signal  lines  for  each  processor  element.  The  global  clock  of  the 
synchronous  scheme  is  now  replaced  by  the  Data  Sent  and  the  Data  Used  lines  which  establish 
an  exchange  between  the  adjacent  processors  with  regards  to  data  transmission  timing.  These 
signals  form  a  part  of  the  Self  timed  systems  [5],  each  processor  signalling  to  its  neighbor 
whenever  it  is  ready  to  take  an  action. 


3.4     Timing  Analysis 


As  the  PE  itself  may  be  internally  synchronous,  with  an  internal  clock  period, 
T(pe),  which  is  not  affected  by  factors  outside  the  processors'  bounds,  we  shall  assume  such 
a  model.  Note  also  that,  in  contrast  with  the  synchronous  configuration,  the  asynchronous 
PE  will  have  seperate  flow  and  fetch  tasks.  Therefore,  the  number  of  tasks  per  recursion, 
for  the  same  matrix  multiplication  algorithm  applied  to  the  synchronous  model,  would  now  be 
[K  j  mul t ! +K { add } +2*K | fl ow! +2*K { f etch} ] *T ( pe ) . 

The  transfer  of  data,  in  the  basic  model,  from  PE(i)  to  PE(j)  calls  for  PE(i)  to 
apply  the  appropriate  data  to  the  interprocessor  data  bus,  and  a  pulse  generated  on  the 
Data-Sent  control  line  (see  figure  8).  The  width  of  that  pulse,  t(l),  must  be  greater  than 
the  data  setup  time  of  the  Data     Input  Buffer  of  PE(j).     After  the  pulse  has  been  generated, 
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PE(i)  can  turn  to  the  next  task  of  the  recursion.  It  is  of  interest  to  note  that,  although 
the  Data  Input  Buffer  and  the  Data  Transfer  Control  flip  flop,  described  below,  are 
physically  within  the  domain  of  PE(j),  from  a  timing  point  of  view  they  must  concede  to  the 
timing  constraints  of  PE(i). 


I'EMORV  MODULES 
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First  wavefront 
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Pig.  7:   Configuration  for  NxN  WAP 


Pig.  8:   Interprocessor  Handshaking  Scheme 


The  negative  edge  of  the  Data-Sent  pulse  enters  the  data  into  the  Data  Input  Buffer 
of  PE(j),  and  also  toggles  the  Data  Transfer  Control  flip  flop,  thereby  notifying  PE(j)  of 
the  availability  of  data.  In  general,  PE ( j )  will  be  waiting  for  that  data,  and  will 
immediately  execute  a  FETCH  instruction.  The  time  lost  in  this  transaction  is,  in  the  worst 
case,  T(pe( j) )+d( ff ) ,  where  d(ff)  is  the  input  to  output  delay  involved  in  toggling  the  flip 
flop.  Upon  completion  of  the  data  FETCH,  PE(j)  issues  a  Data-Processed  pulse  of  duration 
t(2),  where  t(2)  must  be  larger  than  the  clock  pulse  width  required  by  the  flip  flop.  The 
flip  flop  transition,  in  the  form  of  the  Data-Used  signal,  is  then  propagated  back  to  PE ( i ) . 
This  involves  a  time  delay  of  d(ff)+d(p),  where  d(p)  is  the  propagation  delay  of  the 
interprocessor  flagging  signal. 

In  none  of  the  appl icational  algorithms  implemented  todate  has  there  been  a 
necessity  to  have  two  consecutive  FLOW  tasks  in  the  same  direction.  Thus,  after  a  FLOW  has 
been  executed,  PE(i)  is  implementing  its  next  task  concurrently  with  the  FETCH  executed  by 
PE(j).  By  the  time  PE(i)  has  to  carry  out  the  next  FLOW  to  PE(j),  the  Data-Used  signal  will 
have  already  been  set,  and  there  is,  therefore,  no  waiting  involved."  The  timing  penalty  for 
this  situation  is,  at  worst,  t(2)+d(ff).  This  penalty  is  paid  only  once,  and  does  not 
multiply  by  the  number  of  recursions,  nor  by  the  number  of  FLOW  (FETCH)  tasks,  provided  they 
are  not  consecutive.  To  this  delay,  one  must  add  the  PE  time  added  for  seperate  FLOW/FETCH 
instructions,  which  is  f*T(pe).  Evaluation  of  the  efficiency  factor  for  the  asynchronous 
model  is  highly  dependent  on  both  the  architecture  of  the  PE  and  on  the  algorithm  that  is 
being  implemented.  Based  on  some  reasonable  assumptions  relating  to  the  execution  time  ratio 
between  multiplication  and  the  other  operations  involved  in  the  bench  test  matrix 
multiplication  algorithm,  EFF  is  calculated  to  be  of  the  order  of  72$.  As  has  been 
mentioned,  a  higher  computation  intensity  within  the  PE  will  improve  the  efficiency  factor 
in  this  model. 


4.   Comparison     of  Synchronous  and  Asynchronous  Configurations 


There  are  numerous  criteria  in  evaluating  the  timing  schemes  of  the  processing 
array.  The  performance  tradeoffs  between  the  two  general  approaches  are  highly  dependent  on 
the  internal  architectural  features  of  the  individual  PE,  such  as  pipelining  and 
concurrency,  the  number  of  PEs  required  in  the  array,  the  processing  technology  employed,  as 
well  as  the  requirements  and  nature  of  the  algorithms  implemented  by  the  array.  Timing 
analysis  can  be  applied  to  any  given  system  configuration,  resulting  in  curves  of  the  type 
provided  in  Fig.  6.  The  crossover  points  between  attractivity  of  an  asynchronous  scheme  to 
that  of  a  synchronous  design  can  then  be  found  from  these  curves.  Since  the  0(N**3) 
relationship  between  clock  skew  and  number  of  PEs  is  not  technology  dependent,  it  seems 
evident     that,  for     large  array  sizes,     the  asynchronous  scheme,     provided  by    the  WAP,  will 
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prevail . 


There  are  also  other  considerations  which  may  be  crucial  in  the  favoring  of  one 
configuration  over  the  other.  The  globally  synchronous  array  will  severely  limit  the 
programmabil ity  and  somewhat  hamper  the  system  extendabil ity .  This  lack  of  flexibilty  will 
diminish  its  cost-effectiveness.  In  this  respect,  the  powerful  notion  of  computational 
wavefront  propagation,  together  with  the  asynchronous  timing  capability,  enable  us  to  deal 
with  the  issues  of  pr ogrammab il ity  and  extendabil ity  quite  readily.  On  the  other  hand, 
synchronous  arrays  may  provide  more  facile  chip  testability  ,  due  to  their  regular  timing 
features . 

The  above  discussion  indicates  the  need  for  establishing  a  careful  tradeoff  between 
the  two  timing  schemes.  It  also  opens  the  door  to  the  concept  of  GALS  (Globally 
Asynchronous  and  Locally  Synchronous)  configurations,  which  is  a  logical  compromising 
outcome  of  the  previous  analysis.  The  GALS  approach,  which  will  incorporate  blocks  of 
synchronous  PEs  in  a  globally  asynchronous  system,  will  allow  a  merging  of  the  merits  of 
both  timing  schemes,  establishing  a  justifiable  tradeoff  between  the  two.  However,  the 
outcome  of  the  analysis  of  GALS  is  has  not  yet  become  clear  at  this  point. 


5.     Timing  Aspects  of  the  WAP 


As  the  WAP  has  a  globally  asynchronous  configuration,  timing  problems  arise  only  as 
a  result  of  the  processor-to- processor  asynchronous  communication  scheme.  The  major 
considerations,  from  a  system- level  point  of  view,  include  inhibiting  deadlock  and  racing 
situations  within  the  array.  The  elimination  of  these  hazardous  conditions  is  carried  out 
by  means  of  both  hardware  features  and  software  syntax  rules  and  constraints.  We  will  deal, 
here,  only  with  racing  problems,  as  they  are  are  caused  solely  by  the  self  timing  feature. 

In  order  to  retain  the  wavefront  oriented  propagation  of  computation,  each  front  of 
the  wave  must  be  tied  to  its  preceding  and  succeeding  fronts  in  such  a  manner  so  as  to 
eliminate  the  possibility  of  a  processor's  "running  away"  without  transferring  the  required 
data  and  activity  parameters  to  its  neighbors.  To  clarify  this  issue,  assume  that  PE1  is 
the  left-hand  predecessor  of  PE2,  and  that  the  current  algorithm  requires  r  recursions,  each 
of  which  includes  the  instruction  sequence  of  Pig.  9- 

If  PE1  is  not  tied  to  PE2  by  some  restrictive  rules,  it  can,  conceivably,  complete 
its  recursion  in  much  less  time  than  PE2.  Thus,  PE1  may  be  flowing  data  resulting  from  its 
third  recursion  when  PE2  has  just  completed  its  first  recursion  and  is  expecting  second 
recursion  data  from  PE1 . 

The  ensuing  conclusion  is  that  PE1  and  PE2  must  be  chained  together.  Originally, 
this  chaining  was  carried  out  by  the  concept  of  activity  movement.  Once  PE1  finished  the 
tasks  associated  with  one  recursion,  it  propagated  activity  via  an  Activate  signal  to  PE2. 
The  concept  of  transfer  of  activity  does  not,  however,  resolve  all  of  the  communication 
problems  within  the  array.  One  outstanding  example  of  this  occurs  when  PE1  transmits  data  to 
PE2  twice  within  a  recursion.  If  PE2  is  slow,  and  has  not  managed  to  input  the  first  data 
word  transmitted,  that  data  will  be  written  over  by  PE1  before  it  was  accessed  by  PE2,  and 
is  therefore  lost.  A  possible  solution  to  this  dilemma  would  be  to  create  a  FIFO  input 
buffer  to  accumulate  incoming  data  until  it  has  been  digested  by  the  receiving  processor. 
This  imposes  further  system  constraints,  such  as  defining  the  depth  of  the  buffer,  which,  in 
turn,  will  call  for  additional  program  bookkeeping  and  resource  allocation  management. 


REPEAT 

WHILE  WAVEFRONT  IN  ARRAY  DO 
BEGIN 

FETCH  <From  Left>; 


FLOW     <To  Right>; 


END; 

UNTIL  TERMINATED; 


FLOW  (PE1):         WAIT  UNTIL  Data  Used 
IF  Data  Used  THEN: 

TRANSFER  DATA  TO  THE  DATA  BUS 
SET  Data  Sent  TO  "1 " 
RESET  Data  Used  TO  "0" 

FETCH  (PE2):         WAIT  UNTIL  Data  Ready 
IF  Data  Ready  THEN: 

INPUT  DATA  FROM  THE  INPUT  BUFFER 
SET  Data  Used  TO  "1 " 


Fig-  9: 


Instruction  Sequence  Relating 
to  One  Recursion. 


Fig.   10:     Data  Transfer  Sequence 
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As  the  source  of  the  "run-away"  problem  appears  to  be  the  need  for  a  "wait"  state 
which  will  hold  PE1  from  proceeding  in  its  tasks  without  waiting  for  PE2,  it  seems  natural 
to  tie  the  "  wait"  state  concept  to  the  data  transfer  itself,  rather  than  to  the  wavefront 
propagation  phenomenon,  the  integrity  of  which  it  is  helping  to  retain.  It  is  therefore 
suggested  to  replace  the  Activate/ Active  signal  pair  with  data  transfer  handshaking  signals, 
as  described  above.  Each  processor  will  have  four  input  buffer  registers,  one  from  each 
direction,  and  two  pairs  of  handshaking  signals  (one  pair  for  each  direction  of  data 
transfer)  to  each  of  its  four  neighbors  to  provide  the  communicating  relationship  necessary. 
These  two  signals  are  the  Data  Sent  (D.S.J  and  Data  Used  (D.U. )  signals  described  above. 
The  sequence  of  events  invoked  by  a  data  transfer  is  given  in  Pig.  10.  Data  Ready  is  a 
derivative  of  Data  Sent. 

This  sequence  ensures  that  a  transmitting  PE  will  never  overwrite  previously  sent 
data  before  it  has  been  used,  and  that  a  receiving  PE  will  always  be  reading  new,  unused 
information.  Thus,  the  data  transfer  mechanism  "chains"  the  two  adjacent  PEs  together  in  a 
timewise  loose,  yet  sequentially  determined  manner. 

Another  timing  problem  created  by  the  asynchronous  nature  of  WAP  communications 
involves  the  propagation  of  the  Terminated  flag.  Each  PE  has  its  independent  Terminated 
flag,  which  it  propagates  to  its  down-  and  right-hand  neighbors.  Each  processor  also 
receives  two  incoming  Terminated  signals  from  its  up-  and  left-hand  neighbors.  When  a 
processor  has  finished  the  tasks  called  for  within  one  recursion,  it  checks  for  the 
conditions  of  terminating  the  current  phase  of  instructions.  In  every  PE  except  for  the 
(1,1)  cell,  termination  is  dependent  upon  the  termination  of  the  PE ' s  predecessors.  Here, 
too,  a  "runaway"  condition  is  possible.  The  solution  to  this  situation  is  quite  simple. 
Details  of  the  solution  are  postponed  to  a  later  paper. 


Conclusion 


We  have  presented  a  comparison  between  the  synchronous  and  the  asynchronous  timing 
analysis  of  a  square,  NxN  array  of  processors..  Whereas  the  asynchronous  model  incurs  a 
fixed  time  delay  overhead  due  to  the  handshaking  processes,  the  synchronous  time-delay  is 
due  primarily  to  the  clock  skew  which  changes  dramatically  with  the  size  of  the  array,  N. 
More  precisely,  our  analysis  indicates  that  the  clock  skew  grows  with  array  size  at  a 
significant  rate  of  0(N**3)«  An  immediate  conclusion  from  this  analysis  is  that,  while  for 
small  N  a  globally  synchronized  processor  array  may  be  easier  to  implement,  for  larger  N  a 
self-timed  (asynchronous)  system  may  become  much  more  favorable.  However,  there  are  -  and 
we  have  briefly  looked  into  -  some  other  important  factors,  such  as  programmabil ity , 
extendab il ity ,  testing,  racing,  etc.  which  must  be  taken  into  consideration.  In  short,  the 
only  definitive  conclusion  is  that  the  ultimate  decision  has  to  hinge  upon  the  final 
hardware  performance  evaluation     as  well  as  the  system  appl icati onal  requirements. 
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APPENDIX  A 


Let  L  be  the  dimension  of  a  PE.  Then  the  array  dimension  will  be  N*L.  Taking  each 
segment  of  the  H-tree  as  a  lumped  R-C  branch  provides  the  equivalent  circuit  of  Fig.  A1 . 
Note  that  the  innermost  branch  of  the  H  tree,  which  provides  the  final  clock  signal  to  the 
PE,  corresponds  to  the  rightmost  segment  of  the  circuit,  consisting  of  R^and  Cq.  Each  "H" 
of  the  H-tree  is  replaced  by  two  levels  of  RC  segments,  as  shown.  As  the  arms  of  any  H 
structure  are  of  equal  length,  the  values  of  R  and  C  representing  both  levels  of  the  H 
structure  are  equal.  Also,  as  the  lengths  of  the  H  arms  double  from  one  H-level  to  the 
next,  the  values  of  both  R  and  C  at  each  H-level  will  be  twice  those  of  their  successor 
H-levels.  The  levels  of  the  H-tree  structure  are  denoted  by  q,  where  1 IqilogpN.  The  values 
of  R  and  C  in  the  root  of  the  distribution  tree  will,  therefore,  be  N*Rq  and  N*Cq  , 
respectively . 

The  clock  distribution  network  is  divided  into  paths  of  metal  conductor  and  those  of 
diffusion  (or  polysil icon) .  The  metal  will  have  a  capacitance/area  of  C( m)  and  a 
resistance/square  of  R(m)  associated  with  it,  and  the  diffusion  will  have  C(  d)  and  R(d), 
respectively.  The  clock  path  to  each  PE  consists  of  r  parts  metal  and  (1-r)  parts 
diffusion.  We  assume  that  this  distribution  of  the  conducting  material  holds  true  for  each 
and  every  segment  of  the  H-tree. 

By  the  simple  circuit  equivalence  shown  in  Fig.  A2(a),  the  system  of  Fig.  A1  can  be 
replaced     by  N**2  parallel     branches,  all     of  which  are     identical  to  that     provided  in  Fig. 
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A2(b)  .  Each  serial  branch  represents  the  equivalent  circuit,  as  seen  by  any  individual  PE, 
between  the  clock  input  at  the  root  of  the  H-tree  and  that  PE. 

As     a    first     order     approximation     of  the     equivalent     circuit,     we     lump    all  the 
resistances  and  all  the  capacitances  of  Pig.  A2(b)   together.     The  sum  of  the  resistances  is: 

q  3  3 

R     =   j3/8*SUM  (8   )     +  N  }    *  R  =   (1  0*N  -  3}*R  fl  (a.1) 
eq  q=1    to  log  NO  0 

2 

and  of  the  capacitances  is: 

-q 

C     =   {1   +  2  *  SUM  (2     )   }  *  C  =  {3  -  2/N?    *  C  (a. 2) 

eq  q=1    to  log  NO  0 

2 

The  equivalent  time  constant  of  the  distribution  network  is,  therefore, 

t        =     MO*N3-  3!*l3  -  2/N)   *RQ*C0/7  (a. 3) 

eq 


It  should  be  noted  that  this  time  constant,  and  the  clock  pulse  rise  time  and  the 
clock  skew  associated  with  it,  are  of  0(N**3).  This  is  corroborated  by  the  simulation 
results  provided   in  the  main  text. 

To  further  verify  the  results,  a  two-pole  approximation  of  the  distribution  network, 
as  shown  in  Pig.  A3,  was  executed.  Here,  C ea(jj  represents  the  parallel  combination  of  all 
the  intermediate  capacitances,  implying  that  all  the  intermediate  resistors  have  been  short 
circuited.     Thus,  ^eq(]_)=  2*C  q*  ( 1 -1 /N  )  .  The  resulting  transfer  function  is  then: 

1 

T  =    (a. 4) 

1  3      2  3     2  2 

1    +  (3N  -2N  +1  )*R  *C  *s  +  2(N  -N   )*(R  *C  *  s) 
0     0  0  0 

The  negative  reciprocal  of  the  poles  for  this  model,  for  N>2,  are: 
3  2 

t    =   (3N  -2N  +1  )*R  *C      and       x    =  2*R  *C  /3  (a. 5) 

1  0    0  2  0  0 

which  are,  of  course  the  time  constants  of  the  step  input  response.  Again,  the  0(N*<3) 
dependency  is  evident. 
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Abstract 

A  novel  multi-bit  convolver/correlator  circuit  is  described.     The  circuit  has  been 
designed  to  operate  as  a  systolic  array  of  simple  one  bit  processor  and  memory  cells  and, 
as  a  result  it  can  operate  at  relatively  high  data  rates  by  making  efficient  use  of  silicon 
area.     Since  the  design  is  extremely  regular  in  nature  and  requires  very  little  control  it 
should  be  easy  to  implement  in  VLSI  technology.     The  size  of  circuit  which  can  be  fabri- 
cated and  the  data  rate  which  can  be  achieved  will  of  course  depend  on  the  specific  tech- 
nology which  is  chosen. 

1.  Introduction 


The  volume  of  literature  devoted  to  systolic  arrays  has  expanded  rapidly  since  Kung  and 
Leiserson  first  demonstrated  how  they  could  be  used  to  carry  out  the  pipelined  computation 
of  various  operat ionsl » 2  and  the  concept  has  now  been  successfully  applied  to  a  wide  range 
of  problems3-6.     One  subject  which  is  likely  to  benefit  considerably  from  these  techniques 
is  that  of  digital  signal  processing?-!^,   the  main  reason  being  that  many  of  the  computa- 
tions which  arise  involve  highly  repetitive  arithmetic  operations  and  very  regular  data 
flow.     To  date  most  of  the  effort  in  applying  systolic  arrays  to  digita.l  signal  processing, 
has  been  concentrated  at  the  word  or  system  level  and  the  relevant  circuits  tend  to  consist 
of  arrays  of  multiplier/accumulator  type  processors.     However,  many  of  the  desirable 
properties  which  result  from  the  systolic  array  approach  can  also  be  exploited  to  advantage 
in  the  design  of  individual  chips  by  considering  problems  at  the  bit  rather  than  the  word 
level.     In  recent  papers-'--'---^  we  have  shown  how  a  number  of  important  digital  signal 
processing  functions  could  be  implemented  using  arrays  of  simple  bit  level  cells  organised 
and  clocked  in  a  systolic  fashion.     The  typical  bit-level  cell  comprises  a  full  adder,  some 
simple  logic  and  a  number  of  latches,   the  function  of  the  circuit  being  determined  by  the 
cell  interconnection  pattern.     The  general  idea  was  illustrated  by  means  of  two  rather 
simple  examples  -  a  pipelined  multiplier  and  a  pipelined  circuit  for  computing  a  one-bit 
transform  slice.     In  this  paper  we  take  the  ideas  a  stage  further  and  describe  a  systolic 
array  of  bit  level  cells  which  could  be  implemented  on  a  single  VLSI  chip  and  is  capable  of 
carrying  out  (in  a  bit  serial  manner)  a  number  of  higher  level  multi-bit  functions  including 
convolution,   correlation  and  matrix  vector  transforms.     For  the  purposes  of  explanation  we 
focus  most  of  our  attention  on  the  case  of  convolution  as  this  serves  to  illustrate  the 
major  features  of  our  circuit. 

In  sections  2  and  3  the  operation  of  the  circuit  is  described  in  abstract  terms  with 
reference  to  a  geometric  timing/data  flow  diagram.     Section  2  deals  with  the  special  case 
in  which  the  data  and  coefficient  words  are  assumed  to  be  positive  and  this  is  generalized 
to  two's  complement  values  in  section  3.     The  circuit  itself  is  then  described  in  section  4 
and  the  way  in  which  it  can  be  generalized  to  carry  out  correlation  and  matrix  vector  trans- 
forms is  outlined  in  section  5. 

2.     The  basic  convolution  algorithm 


The  jth  output  of  a  N  point  convolution  process  may  be  written  in  the  form 
N-l 

j     =  0,1,2,... 


y  .     =      /      a .  x  .  . 
J  -i-J-i 


(2.1) 


i=0 


where  \<*q, 


and  x .    ( i  =  0 ,    1 , 


lN  ~\S  rePresen^s  the  set  of  fixed  coefficients 
)  represents  a  sequence  of  input  data  (signal)  values 


It  is  assumed  throughout  this  discussion  that  the  coefficient  and  signal  values  take  the 
form  of  n-bit  binary  numbers.     Pipelined  computation  of  this  function  can  be  carried  out 
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using  an  array  of  cells  which  take  the  form  shown  in  figure  1  and  are  illustrated 
schematically  by  means  of  squares  in  figure  2.     Each  cell  has  as  inputs  a  data  bit,  x, 
which  enters  from  the  west,   a  sum  bit,   s',  which  enters  from  the  north,   a  coefficient  bit, 
a,  which  enters  from  the  east  and  a  carry  bit,   c',  which  also  enters  from  the  east.  These 
bits  are  stored  in  latches  (represented  by  small  squares)  and  on  each  pulse  of  a  system 
clock  they  are  made  available  to  the  cell  logic,  which  performs  the  gated  full  adder 
f unct  ion 

s     *•     s'   ©  (a-x)  ©  c' 

(2.2) 

c  (a'x)'s'   +  (a'x)«c'   +  s'«c' 

At  the  end  of  each  cycle  the  resulting  sum  and  carry  bits,   s  and  c,   are  made  available  at 
the  output  of  the  cell  together  with  the  original  coefficient  and  data  bits,   a  and  x. 


Figure  1.     The  basic  cell  required  for  the  bit  level 
convolution/correlator  systolic  array 


The  operation  of  the  proposed  convolution  circuit  is  best  explained  by  means  of  the 
abstract  timing  diagram  which  is  included  in  figure  2.      It  provides  a  simple  geometric 
picture  to  illustrate  the  underlying  structure  of  the  data  and  coefficient  flow  within  the 
array.     Figure  2  refers  to  the  specific  case  (chosen  for  ease  of  explanation)  in  which  the 
coefficients,   a^ ,   and  data  values,  Xj-i>   take  the  form  of  three  bit  positive  numbers  (ie 
n  =  3)  and  the  convolution  is  a  ten  point  process   (ie  N  =  10)  which  produces  6  bit  output 
values . 

The  data  words,   Xj-i>   and  coefficient  words,   a^ ,   enter  the  array  in  a  bit  serial  fashion 
with  zero's  interspersed  between  their  adjacent  bits.     In  the  case  of  the  data  the  least 
significant  bit,  x *2± >    enters  ahead  of  the  next  least  significant  bit,   X^L± >   anc*  so  on, 
whilst  for  the  coefficient  words  the  reverse  is  true  and  so  the  most  significant  bit,  a-:^> 
is  input  first  followed  by  bits  of  lower  significance.     The  input  sequence  for  each  coef- 
ficient and  data  word  is  staggered  relative  to  the  corresponding  word  on  the  row  above  it 
by  one  clock  cycle.     This  means  that  the  bits  of  the  data  words,   x-;_i,   can  be  regarded  as 
forming  a  set  of  rightward  leaning  parallelograms  which  move  from  left  to  right  across  the 
array  progressing  by  one  cell  every  clock  cycle.     Similarly,   the  bits  of  the  coefficient 
words,   ai ,   form  a  set  of  leftward  leaning  parallelograms  which  move  at  the  same  rate  in  the 
opposite  direction. 

It  is  important  to  note  that  the  most  significant  and  least  significant  bits  of  suc- 
cessive words  are  not  separated  by  zero's.     This  means  that  all  bits  within  a  given 
parallelogram  occur  on  either  odd  clock  cycles  (indicated  by  the  longitudinal  shading)  or 
even  clock  cycles  (indicated  by  the  transverse  shading)  and  this  will  be  referred  to  as  the 
phase  of  the  parallelogram.     Clearly  each  data  or  coefficient  parallelogram  takes  the 
opposite  phase  to  its  predecessor.     The  reason  for  these  opposite  phases  will  become 
apparent  below. 

As  successive  parallelograms  move  across  the  array,   data  and  coefficient  bits  of  equal 
phase  interact  within  diamond  shaped  interaction  regions,  which  appear  cross  hatched  in 
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Figure  2.     Timing/data  movement  diagram  describing  the  operation  of  the  proposed 
circuit  for  positive  data  and  coefficient  words 


figure  2  and  which  move  progressively  down  through  the  array  of  cells.     Each  location  within 
an  interaction  region  is  associated  with  the  formation  of  a  single  partial  product  of  the 
form  akx^_^  (k,£  =  1,   2   ...   n-1).     As  each  diamond  moves  down  through  the  array,  successive 
cells  which  appear  at  that  location  within  it  accumulate  all  the  terms  in  the  corresponding 
partial  product  sum, 

N-1 

k,  £  k  i 

y  .         =      /      a .  x  . 

i=0 

k  a 

Successive  interaction  regions  evaluate  the  partial  product  sums,   yj '    ,   associated  with 
successive  convolution  outputs,   y j ,  y_j+i   ...   and  figure  2  illustrates  the  regions  associated 
with  y_g  and  yxo-      I1-  should  be  noted  that  non-zero  partial  product  terms  are  only  formed  on 
cells  within  the  interaction  regions  between  parallelograms  which  have  the  same  phase  (the 
cross-hatched  regions).      In  areas  outside  this  -  the  triangular  regions  to  the  right  and 
left  of  the  diamonds  -  the  parallelograms  are  out  of  rihase  and  so  the  corresponding  cells 
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must  produce  a  zero  product  bit.     As  a  result,   they  cannot  produce  any  undesirable  carry 
terms  which  could  otherwise  propagate  into  the  important  interaction  regions. 

In  order  to  complete  the  convolution  operation  the  partial  product  sums  which  are  formed 
within  each  interaction  region  must  be  accumulated  vertically  as  the  diamond  emerges  from 
the  array.     This  final  accumulation  process  can  be  accomplished  by  adding  an  extra  row  of 
full  adder  cells  to  the  bottom  of  the  circuit  and  feeding  the  output  sum  values  from  these 
cells  back  to  their  own  inputs.     From  the  structure  of  the  timing  diagram  in  figure  2  it 
can  be  seen  that  the  carries  generated  at  any  stage  in  the  process  of  evaluating  a  result 
yj  have  sufficient  time  to  propagate  to  the  most  significant  position,   if  necessary,  with- 
out encroaching  the  interaction  region  of  the  subsequent  result  yj  +  i.     Each  bit  of  y_j  must 
of  course  be  read  from  the  output  of  the  appropriate  bottom  cell  as  soon  as  it  has  been 
completely  evaluated  so  that  the  feedback  to  that  cell  can  be  cleared  in  time  for  the  final 
accumulation  of  y_j  +  l  to  beSin  on  the  next  clock  cycle.     A  complete  result  is  obtained  every 
2n-l  clock  cycles. 

So  far  within  this  discussion  no  consideration  has  been  given  to  the  fact  that  the  word 
length  required  to  represent  the  accumulating  partial  products  will  generally  tend  to  grow 
as  the  interaction  region  moves  down  through  the  array.     However,   this  effect  can  easily  be 
accommodated  within  the  type  of  structure  shown  in  figure  2  by  adding  to  the  left  hand  side 
of  each  row  in  the  array  a  sufficient  number  of  separate  half  adder  cells  to  accommodate 
the  maximum  word  length  possible  at  that  stage  and  so  prevent  overflow.     Alternatively,  the 
input  word  lengths  (and  hence  the  number  of  cells  in  every  row  of  the  array)  could  be 
extended  from  the  outset  to  cover  the  range  of  the  maximum  possible  result. 

3.     Application  to  two's  complement  numbers 

The  type  of  circuit  described  in  section  2  can  also  be  used  to  perform  the  convolution 
operation  in  the  two's  complement  number  representation,  provided  that  the  data  and  coef- 
ficient values  are  sign  extended  to  the  range  of  their  maximum  possible  product  before  they 
enter  the  array.     In  general,   therefore,   an  n-bit  two's  complement  data  or  coefficient  word 
would  be  sign  extended  to  M  =  2n  bits.     The  entire  calculation  can  then  be  carried  out  as 
if  the  input  data  and  coefficient  words  were  M  bit  positive  numbers.     The  only  difference 
is  that  all  bits  of  significance  2M  or  greater  can  be  ignored  as  they  are  not  required 
within  the  calculation  and  so  the  corresponding  cells  can  be  omitted  from  the  array. 
Figure  3  contains  the  circuit  schematic  and  corresponding  timing  diagram  which  is  obtained 
for  the  particular  case  (chosen  for  ease  of  illustration)  in  which  M  =  3,  N  =  10  and  the 
result  can  be  expressed  in  three  bit  two's  complement  form.     The  triangular  shaped  inter- 
action region  which  occurs  in  this  example  is  simply  the  least  significant  section  of  the 
diamond  shaped  interaction  region  in  figure  2. 

As  before,   the  partial  product  sums  which  are  formed  within  the  interaction  region  must 
be  accumulated  vertically  as  the  interaction  region  emerges  from  the  bottom  of  the  array. 
In  this  case,   however,    the  triangular  shape  ensures  that  any  carry  bits  which  are  generated 
in  the  final  accumulation  process  have  time  to  propagate  fully  before  any  bits  within  the 
next  interaction  region  enter  the  bottom  row  of  cells.     The  results  could  therefore  be  read 
out  in  parallel  (once  every  2M-1  clock  cycles)  if  desired. 

With  the  type  of  circuit  described  in  figure  3  it  will  generally  be  necessary,   as  before, 
to  allow  the  word  length  of  the  accumulating  partial  products  to  grow  as  the  interaction 
region  moves  down  through  the  array.     Once  again,   this  can  be  accomplished  by  adding  a 
number  of  extra  cells  to  the  left  hand  side  of  each  row  in  the  array.     In  this  case  it  is 
essential  to  ensure  that  the  partial  product  sign  bits  are  extended  into  these  cells  and 
they  must  therefore  be  capable  of  performing  the  full  adder  function. 

4.      Implementation  of  the  two's  complement  circuit 

At  first  sight  it  might  appear  that  the  circuit  which  is  illustrated  schematically  in 
figure  3  requires  two  sets  of  parallel  input  ports  -  one  for  coefficient  bits  and  the  other 
for  the  data  bits.     However,   a  close  look  at  the  timing  diagram  will  show  that  this  is  not 
the  case.     In  effect,   the  bits  of  each  coefficient  word,   a^ ,  move  cyclically  through  the  M 
cells  within  the  ith  row,   each  being  required  as  an  input  to  the  right  hand  cell  M-l  clock 
cycles  after  it  has  emerged  from  the  left  hand  cell.     This  effect  could  be  achieved  in 
practice  by  having  an  extra  latch  associated  with  each  cell  and  clocking  the  coefficient 
bits  back  through  M-l  of  them,  suitably  interspersed  between  the  rows  of  the  array.  This 
arrangement  would,  of  course,  preserve  the  nearest  neighbour  communication  property  within 
the  array.     It  can  also  be  seen  from  the  timing  diagrajn  that  each  bit  of  a  data  word  x^  is 
required  as  an  input  to  the  left  hand  cell  M-2  clock  cycles  after  it  has  emerged  from  the 
right  hand  cell  of  the  row  beneath.     This  effect  could  be  achieved  by  clocking  the  data  bits 
back  through  M-2  latches  and  the  net  effect  is  that  data  words  x_i  wind  their  way  progres- 
sively up  through  the  array  in  a  bit  serial  fashion.     It  follows  that  the  data  words  could 
be  input  to  the  circuit  using  a  single  pin.     In  practice,   it  would  be  advisable  (but  not 
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Figure  3.     Timing/data  movement  diagram  describing  the  operation  of  the  proposed 
circuit  for  two's  complement  data  and  coefficient  words 


essential)   to  provide  N  parallel  input  ports  through  which  the  coefficient  words  can  enter 
the  circuit.     Assuming  that  the  input  facility  is  disabled  at  the  appropriate  time  the 
words,   ai ,  would  then  recirculate  until  it  is  necessary  to  update  them.     This  flexibility 
would  be  desirable  if  the  proposed  circuit  were  to  be  used  as  an  adaptive  filter  and 
provides  the  facility  to  carry  out  the  type  of  matrix  x  vector  product  discussed  in  section 
5. 

In  most  signal  processing  applications,   it  is  sufficient  to  provide  a  result  which  has 
been  suitably  truncated  or  rounded  and  so  it  may  be  possible  to  remove  a  number  of  columns 
of  processing  cells  from  the  right  hand  side  of  the  array  in  figure  3  without  affecting  the 
final  answer.     However,    it  must  be  remembered  that,    in  general,   several  partial  products 
from  each  multiplication  will  be  truncated  when  a  column  is  removed  in  this  way.     It  is 
also  important  to  note  that  the  timing  relationships  which  exist  in  the  original  circuit 
must  be  maintained  and  so  the  latches  associated  with  every  cell  must  be  retained  even  if 
the  processing  element  is  removed.     If  a  rounded  result  is  required,   then  a  correction  term 
which  represents  the  mean  of  all  discarded  bits  must  be  added  to  every  result  v/hich  emerges 
from  the  circuit.     This  could  easily  be  accomplished  by  initialising  the  result  to  the 
appropriate  value  on  the  bottom  row  of  cells. 
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Figure  4  shows  the  envisaged  layout  of  a  five  point  convolver  designed  for  a  particular 
situation  in  which  the  data  and  coefficient  words  are  sign  extended  to  four  bit  two's 
complement  form  (the  range  of  the  exact  result)  and  the  least  significant  column  of  cells 
has  been  truncated  from  the  circuit.     In  this  case  the  number  of  feedback  delays  required 
for  the  data  and  coefficient  bits  is  M'   and  M1   +1  respectively  where  M'   =  M  -  1  =  3  is  the 
reduced  number  of  cells  in  each  row  of  the  array  and  so  the  latches  in  question  have 
been  distributed  such  that  every  cell  in  the  circuit  (cell  (a)  in  figure  4)  now  contains 
two  latches  more  than  the  basic  cell  shown  in  figure  1.     One  of  these  is  required  for 
clocking  back  the  coefficient  bits,   and  the  other  for  clocking  ba.ck  the  data  bits.     Each  row 
in  the  array  requires  one  extra  latch  to  delay  the  coefficient  bits  but  this  can  be  added  to 
the  right  hand  side  of  the  circuit  without  affecting  the  regularity  of  the  array.     Figure  4 
also  shows  the  row  of  adder  cells   (cell  type  (b))  required  to  carry  out  the  final  accumu- 
lation process  at  the  bottom  of  the  array. 


S' 


C  4 
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Figure  4.     Details  of  the  proposed  circuit.     The  example  shows  five  stages  of 

a  four  bit  correlator  circuit  which  has  been  truncated  to  3  columns. 
x+  and  a+  represent  the  bits  of  words  a  and  x  which  are  being  fed  back. 


5.  Discussion 

Having  been  designed  as  a  bit  level  systolic  array,   the  type  of  circuit  described  in 
section  3  exhibits  a  number  of  important  features  which  make  it  particularly  suitable  for 
very  large  scale  integration.     These  features  have  been  discussed  in  detail  in  orevious 
papers11-1^  but  are  worth  summarising  here: 

(1)     The  convolution  function  has  been  factorised  down  to  an  array  of  identical  sinrole  cells 
and  circuits  of  this  type  are  relatively  easy  to  design,   layout  and  test. 
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(2)  Each  cell  is  connected  to  its  nearest  neighbours  only.     As  a  result  the  interconnection 
capacitance  is  reduced  to  a  minimum  and  so  the  circuit  performance  is  enhanced.  Data 
transmission  problems  (for  example  along  long  polysilicon  lines)  are  also  avoided. 

(3)  The  dimensions  of  such  an  array  can  easily  be  expanded  if  necessary. 

(4)  The  circuit  is  pipelined  at  the  bit  level  and  hence  the  maximum  clock  rate  is  deter- 
mined solely  by  the  propagation  delay  through  a  single  cell.     This  ensures  the  maximum 
possible  (bit  serial)  throughput  rate. 

(5)  The  only  control  circuitry  required  is  a  simple  two  phase  clock  to  control  the  move- 
ment of  bits  in  the  circuit  and  a  secondary  clock  or  counter  to  control  the  output  word 
cycle . 

At  first  sight  the  type  of  circuit  which  is  described  in  figures  2  and  3  may  appear  to  be 
rather  inefficient.     For  example,   the  partial  products  are  only  generated  on  cells  within  an 
interaction  region  and  these  in  turn  are  only  used  to  form  a  product  on  every  other  clock 
cycle.     However,   it  is  important  to  remember  that  the  cells  are  also  used  to  accumulate  the 
partial  products  and  add  in  any  carries  which  are  generated  in  the  process.     As  a  result, 
the  adder  function  in  each  cell  is  used  much  more  frequently.     The  method  which  we  have 
proposed  for  operating  with  two's  complement  data  and  coefficient  values  would  also  appear 
to  be  rather  inefficient  since  the  effect  of  extending  input  words  from  n  bits  to  2n  bits  is 
to  halve  the  throughput  rate  of  the  circuit.     There  are  a  number  of  alternative  techniques 
for  handling  two's  complement  numbers  without  extending  the  wordlengths  in  this  way  but  they 
all  involve  the  use  of  a  more  complicated  cell  and  our  initial  investigations  suggest  that 
the  overall  performance  of  the  circuit  may  well  be  reduced  as  a  result.     On  balance,   it  may 
be  better  to  carry  out  the  convolution  operation  using  a  larger  number  of  simple  cells  each 
of  which  operates  faster,   occupies  less  area  and  consumes  less  power. 

The  type  of  systolic  array  which  has  previously  been  proposed  for  carrying  out  the  con- 
volution operation  - 1 5 , 14  usually  involves  the  use  of  much  larger  processing  cells,  each 
comprising  a  number  of  word  level  components  such  as  multipliers,   adders  and  delay  elements. 
These  in  turn  comprise  a  number  of  bit  level  components  but  these  do  not  necessarily 
operate  in  a  systolic  fashion.     Complete  products  of  the  form  aiXj_i  are  formed  within  a 
single  multiplier  and  then  added  to  the  accumulating  result  which  must  be  delayed  by  the 
appropriate  number  of  clock  cycles.     Our  approach  is  quite  distinct  since  the  entire  con- 
volution process  is  formulated  at  bit  level  right  from  the  outset  and  involves  a  highly 
regular  array  of  distributed  processing  and  memory  elements.     No  subset  of  these  elements 
can  be  associated  with  a  specific  multiplication  or  addition  operation  at  the  word  level. 
These  operations  take  place  instead  within  a  moving  interaction  region. 

The  bit-level  systolic  array  which  we  have  described  in  this  paper  can  be  used  to  perform 
a  number  of  other  signal  processing  functions.  For  example,  the  correlation  operation  which 
may  be  expressed  in  the  form 

N-l 

y  •     =      /     a .  x .  ,  .  = 
i=0 

can  obviously  be  evaluated  by  means  of  the  data  and  coefficient  timing  structure  given  in 
figure  2  or  figure  3  provided  that  the  coefficients  a^  and  a^^  (i  =  1,   2   ...  N)  are  inter- 
changed. 

The  array  can  also  be  used  to  implement  two  types  of  algorithm  for  generating  matrix  x 
vector  products.     One  of  the  algorithms  is  particularly  suitable  for  use  with  banded 
matrices  while  the  other  is  more  appropriate  for  full  matrices.     It  is  not  appropriate  to 
discuss  the  procedures  in  detail  here  but  the  underlying  data  structure  can  be  deduced  from 
the  word  level  schematic  timing  diagrams  given  in  figures  5(a)  and  5(b).     The  algorithm 
outlined  in  figure  5(a)  is  similar  in  many  respects  to  the  one  proposed  by  Kung2  for  use 
with  banded  matrices  but  it  is  worth  pointing  out  that  the  role  of  the  intermediate  zero 
words  in  Kung's  algorithm  is  taken  up  by  the  intermediate  zero  bits  in  our  structure.  It 
follows  from  figure  5(a)  that  each  data  word  Xi  simply  moves  up  through  the  array  (in  a  bit 
serial  manner)  whilst  the  interaction  region  associated  with  each  result  moves  down .  In 
the  particular  situation  where  the  elements  on  each  diagonal  of  the  matrix  are  equal  and 
every  diagonal  above  (below)  the  leading  diagonal  is  zero,   the  algorithm  in  figure  5(a) 
corresponds  exactly  to  the  convolution  (correlation)  algorithm  described  in  detail  in 
figures  2  and  3  where  each  coefficient  simply  cycles  round  on  a  given  row  of  cells.  The 
algorithm  described  in  figure  5(b)   is  similar  in  many  respects  to  the  one  proposed  by  Hwang 
and  Chengl4  and  iS  more  efficient  in  situations  where  the  matrices  are  full.     The  important 
point  to  note  is  that  in  this  case  each  word  of  the  data  vector  cycles  round  on  a  given  row 
of  cells  while  a  sequence  of  different  matrix  elements  is  input  to  the  same  row.     It  would 


N-l 

£ 

i=0 


a.T 

— N- 


1-J+- 


— l— N+j-i 


(5.1) 


i=0 
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be  necessary,   of  course,   to  have  tv/o  sets  of  parallel  input  ports  for  this  application. 


Figure  5.     Schematic  diagram  illustrating  how  the  convolution  circuit 
could  be  used  to  evaluate  matrix  x  vector  products: 
(a)     for  banded  matrices       (b)     for  a  general  matrix 


6.  Conclusions 

In  this  paper  we  have  described  a  bit-level  systolic  array  which  could  be  used  to  perform 
a  range  of  signal  processing  functions.     Our  initial  investigations  suggest  that  an  array  of 
20  x  32  cells  could  be  integrated  on  a  single  chip  using  CMOS  technology  with  a  minimum 
feature  size  of  3  ym15.     This  circuit  could  be  clocked  at  a  rate  somewhere  in  excess  of 
20  MHz  providing  16  bit  results,   say,   from  a  32  point  convolution  process  at  a  rate  of  0.5 
to  1  MHz. 
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Abs  tract 

A  tutorial  review  of  incoherent  electro-optical  processing   (EOP),   a  relatively  new 
approach  to  compact,  high-speed  signal  processing  is  presented.     The  tutorial  begins  with 
a  brief  explanation  of  the  differences  between  coherent  and  incoherent  optical  proces- 
sors.    Different  EOP  architectures  will  be  covered  next  with  emphasis  placed  on  temporal 
scanning  processors  which  have  the  capability  of  performing  real-time,   linear  transforma- 
tions on  time  varying  inputs.     Systems  of  this  type  are  directly  applicable  to  many  sig- 
nal processing  problems  in  radar  or  sonar.     The  presentation  is  concluded  with  a  discus- 
sion of  the  state-of-the-art  temporal  scanning  EOP  which  includes  a  light  emitting  diode 
and   imaging  charge  coupled  device  as  its  main  components. 

Introduction 

Recent  advances  in  solid  state  technology  have  made  incoherent  electro-optical  proces- 
sors  (EOP)   viable  candidates  for  real-time  signal  processing  systems. 1»2     Solid  state 
technology  has  reduced  the  size  and  power  requirements  of  EOPs  while  increasing  their 
speed  and  durability.     This  has  generated  interest  in  incoherent  EOPs  and  quite  often 
many  questions  from  individuals  with  backgrounds  not  oriented  in  optics.     What  is  the 
difference  between  coherent  and  incoherent  optical  processors?     Are  there  different  types 
of  incoherent  processors?     Do  incoherent  systems  require  lasers,   lenses  or  mirrors? 
Where  is  the  state-of-the-art  today?     These  are  some  of  the  questions  that  will  be 
answered  in  this  brief  tutorial. 

Differences  Between  Coherent  and  Incoherent  Optical  Processors 

The  main  attraction  of  a  coherent  optical  system  is  the  two  dimensional  Fourier  trans- 
form relation  that  exists  between  the  front  and  back  focal  plane  of  a  lens  placed  in  the 
system.     This  relationship  allows  operations  to  be  performed  in  the  spatial  frequency 
domain,   e.g.,   spatial  filtering.     Incoherent  optical  systems  do  not  possess  this  prop- 
erty,  therefore,  operations  are  performed  in  the  spatial  domain.     An  analogy  can  be  drawn 
with  electronic  systems  where  filtering,   for  example,  can  be  performed  in  the  frequency 
domain  by  a  simple  multiplication: 

y(t)  =  7-1  |y(w)}=  -7_1{x(w)H(w)}  (1) 
or   in  the  time  domain  by  a  convolution 


CD 

y(t)=     J"   X  (T)h  (t-T)dT.  (2) 

—  CO 

Equation    (1)    can  be  implemented  very  easily  with  a  coherent  processor   in  the  form  of  a 
Vander  Lugt  filter^  while  the  equivalent  of  time  domain  convolution    (i.e.,   equation  (2)) 
in  the  spatial  domain  would  be  necessary  in  an  incoherent  processor. 

Traditionally,  coherent  systems  are  more  sensitive  to  environmental  factors  such  as 
dust  and  vibration  than  incoherent  systems  and  usually  do  not  operate  in  real  time  as 
some  incoherent  systems  do.     However,   state-of-the-art  coherent  systems  employing  inte- 
grated optics  do  operate  in  real-time  at  frequencies  and  resolutions  compatible  with  many 
radar  applications. 

Coherent  systems  possess  the  property  defined  by   (1)   since  they  include  a  monochroma- 
tic or  single  frequency  light  source  which  produces  a  coherent  optical  field.     This  field 
is  a  function  of  both  position  and  time  and  since  all  detectors  integrate  over  a  period 
of  time  long,   compared  to  the  fluctuations  of  the   field,   the  intensity  of  the  detected 
field  is  a  time  averaged  quantity. 

I (x) =<V(x, t) V* (x , t)>  (3) 
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where  the  brackets  denote  the  time  average  and  the  star  denotes  the  complex  conjugate. 
Equation   (3)    is  valid  whether  or  not  the  field  is  coherent,  partially  coherent  or  inco- 
herent.    If  there  is  complete  correlation  between  every  point  in  the  field,  i.e., 

I  (x     )=<V(x     t)V*(x     t)>=  MAXIMUM 
nra'  n,  m, 

then  the  field  is  coherent  and  the  space  and  time  dependence  can  be  separated  to  yield 
(the  time  dependence  cancels) 


(4) 


I  (x     )  =  +  (x,  )  + 
nm  1 


(x2) 


(5) 


where  ^    is  the  complex  field  amplitude.     Normalizing    (5)    by  the  square  root  of  the  pro- 
duct of  the  individual  intensities  at  the  two  field  points  and  taking  the  absolute  value 
results   in  the  correlation  coefficient  which   is  a  convenient  measure  of  coherence.  For 
the  coherent  case,   the  correlation  coefficient  is  unity, 


v(x  )  v  *  (x  ) 
n  m 


[I(xn)I(xm)]  1/2 


=  1 


(6) 


that  is,  the  field  points  have  a  fixed  phase  relation.  For  the  incoherent  case,  the 
light  at  any  one  point  in  the  field  is  completely  uncorrelated  to  every  point  in  the 
field  over  the  time  average,  therefore, 


<v(xntt)vMxm>t)> 

[l(xn)l(xm)]  1/2 


0. 


(7) 


n^m 


Note  that  the  instantaneous  value  of  V  (xn f  t)  V*  (xm,  t)    is  not  necessarily  zero  but 
since  the  fluctuation  at  xn  versus  time  is  unrelated  to  the  fluctuation  at  xm  versus 
time,   the  time  average  of  this  quantity   (<•»   is  zero.     Partially  coherent  fields  have 
correlation  coefficients  falling  between  the  two  extremes  defined  above.  Theoretically, 
a  field  is  never  completely  incoherent.     In  practice,   though,   an  incoherent  field  can  be 
defined  if  the  region  over  which  the  field  is  coherent  is  smaller   than  the  field  resolu- 
tion of  the  system  imaging   it.     Typically,    the  spatial  coherence  of  a  field  is  repre- 
sented by  a  coherence  interval  which  is  the  separation  of  two  points  in  that  field  for  a 
given  value  of  the  correlation  coefficient  while  the  temporal  coherence  is  represented  by 
a  coherence  length  defined  by 


c 
A  v 

wher  e  A 


A  \ 

is  the  spectral  width  and  \  is  the  center  wavelength, 
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(8) 


The  operations  achievable  by  incoherent  systems  can  be  described  by  the  one  and  two 
dimensional  superposition  integrals  and  their  discrete  counterparts: 


g(u) 


/ 


f (r)h(r,u)dr 


(9) 


n=n. 


g(m)    =  H 


h(m,n)f(n)     m  =  1,2, 


.  M 


(10) 


g(u,v)  = 


r=r9  s=s9 

J  J 

:=r,  s=s. 


f(r,s)h(r,s;u,v)drds 


(ID 


g(m,p)    =      X)  h(m,n)f(n,p) 


m  =  1,2, .. .M 


P  =  1,2, 


(12) 
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The  operations  implied  by  these  equations  are  numerous  and  include  all  linear  transforma- 
tions. 

There  are  three  classes  of  incoherent  optical  processors: 

•  spatial  scanning  processors, 

•  non-scanning  processors,  and 

•  temporal  scanning  processors. 

A  brief  description  will  be  given  of  spatial  scanning  and  non-scanning  processors.  More 
detail  will  be  provided  for  temporal  scanning  processors  due  to  their  applicability  to 
real-time  signal  processing. 

Spatial  Scanning  Processors 

By  definition,   spatial  scanning  processors  are  systems  which  compute  the  correlation 
function  by  lateral  translation  or  scanning  of  some  element  in  the  system.     Inputs  are 
functions  of  spatial  variables,  e.g.,  x  and  y.     There  are  four  types  of  spatial  scanning 
systems  depending  on  the  number  of  variables  and  channels: 

•  one  dimensional,   f(x)   and  h(x) 

•  one  dimensional  multichannel,   f(x)   and  hn(x) 

•  two  dimensional,   f(x,y)    and  h(x,y) 

•  two  dimensional  multichannel,   f(x,y)   and  hm  n(x,y). 

Two  dimensional  systems  are  natural  extensions  of  one  dimensional  systems  and  one  dimen- 
sional single  channel  systems  are  simplifications  of  one  dimensional  multichannel  systems. 

In  one  dimensional  multichannel  systems,   f  and  h  vary  in  only  one  dimension.  However, 
h  includes  N  separate  channels  as  shown  in  Figure  1.1 


IMAGING  INTEGRATING  DETECTOR 

LENS  LENS  ARRAY 

SYSTEM 


Figure  1.     The  one-dimensional,  multichannel  spatial  scanning  processor. 


This  system  will  simultaneously  correlate  f(x)   with  a  library  of  reference  functions, 
hn(x) .     The  astigmatic  lens  system  makes  this  possible  by  collecting  light  transmitted 
by  each  channel  of  the  mask  and  directing  it  to  the  appropriate  detector.     Now,   if  the 
reference  functions   (or   input)   are   (is)    translated  by  xQ  the  N  channel  outputs  are  com- 
puted simultaneously  as: 


g  (x  ) 
^n  o' 


f 


f  (x) 


h    (x-x  ) dx 
n  o' 


n=l,2, 


.N 


(13) 


g  (x  )  = 
^n  v  o' 


r 

x=x  - 


f  (x-xQ)hn(x)dx 


n=l,2, 


(14) 
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Non-Scanning  Processors 


Non-scanning  processors  can  perform  a  full  two  dimensional  correlation  without  the 
need  for   relative  motion  between  any  two  elements  as  shown   in  Figure  2.1 


y  f(x,y) 


y  h(x,y) 


g(x,y) 


Figure  2.     Basic  geometry  for  non-scanning  or   "shadow-casting"  processors, 


The  equivalent  of  the  scanning  operation  is  performed  by  taking  advantage  of  the  redun- 
dancy inherent  in  incoherent  optical  systems.     This  type  of  system  is  also  referred  to  as 
a  "shadow-casting"  system. 

A  basic  requirement  for  a  shadow-casting  system  is  that  the   input  function  must  con- 
sist of  a  diffusely  scattering  source,   f(x,y),   such  as  a  CRT.     Light  rays  from  all  points 
of  this  type  of  source  are  spread  evenly  about  the  source  normal.     Also,   if  the  medium  is 
homogeneous  as  we  are  assuming,   the  light  rays  will  travel  in  straight  lines.     This  means 


that  rays   intersecting  at  the  point 


.yQ)    in  the  output  plane  intersect  in  the  h(x,y) 


plane  such  that  they  form  there  a  perfect  shadow  image  of  f(x,y) 
erations,   the  image  is  reduced  by  the  factor 

r  =  b/(a  +  b) 

and  displaced  from  the  axis  in  the  h(x,y)   plane  to 

P  =  (x0d,y0d) 
where 

d  =  a/  (a  +  b)  . 

Neglecting  diffraction  effects,    the  resulting  output  is 


From  geometric  consid- 


(15) 


(16) 


(17) 


x=x2  y=y2 


-  z     '     <l        i  x-x  d      y-y  d  ^ 

=  /     /     f{-^-  ,  HrMh(x, 


y)  dxdy, 


(18) 


y-y 


This  is  calculated  simultaneously  for  all  points   in  the  output  plane.     It  is   important  to 
point  out  that  shadow  casting  systems  are  ultimately  limited  by  diffraction  effects. 
More  specifically,   diffraction  by  the  mask,   h(x,y),   limits   the  space-bandwidth  product  of 
the  system.     Space-bandwidth  products  achievable  by  these  systems  are  somewhat  less  than 
those  achievable  by  spatial  and  temporal  scanning  processors. 


Temporal  Scanning  Processors 


Temporal  scanning  processors  have  the  capability  of  performing  real-time,  linear 
transformations    (encoded  on  the  transparency  h)   on  time  varying   inputs.     Systems  of  this 
type  are  directly  applicable  to  many  signal  processing  problems  because  of  their  real- 
time processing  capability.     To  date,   the  most  useful  temporal  scanning  processors  have 
been  used  to  perform  iterative  multichannel  input/multichannel  output^»6  ancj  single 
channel  input/multichannel  output2'^  processing  functions.     The  former  has  been  mainly 
applied  to  phased  array  radar  and,   although   it  uses  electrical  feedback  from  output  to 
input  and  photodiodes  versus  an  imaging  charge  coupled  device    (ICCD)    in  the  output  plane, 
its  physical  operation  is  quite  similar  to  the  latter.     For  purposes  of  this  brief 
tutorial  it  will  be  more  informative  to  discuss  the  single  channel  input/multichannel 
output  system  in  detail. 
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Single  input/multichannel  temporal  processors  perform  operations  in  the  form  of  equa- 
tion (9): 

T2 

g(u)    =     J  f (r ) h (r , u) dr .  (9) 

r=rl 

The  more  useful  application  of  this  system  is  as  a  one  dimensional,  multichannel  correla- 
tor as  depicted  in  Figure  3.1 


CONDENSING  IMAGING  MULTICHANNEL 

LENS  LENS  DETECTOR 


Figure  3.     Single  input,  multichannel  temporal  processor 

Referring  to  Figure  3,   f(t)    is  a  time  varying  electrical  signal  which  modulates  the 
output  intensity  of  the  light  source.     Light  is  collected  by  a  condensing  lens  and  passed 
through  hn(x).     During  the  time  interval  over  which  f(t)    is  to  be  processed,   i.e.,  the 
integration  time,   the  mask  or  the  detector   is  translated  in  the  x  direction  to  give 

t=t2 

g_(x)    =     /        f(t)h    (x-vt)dt     n  =  1,2, ...N  (19) 
t=t, 

where  v  is  the  velocity  of  the  mask  image  relative  to  the  detector.     This  velocity  is 
proportional  to  the  desired  integration  time. 

Two  classes  of  implementations  are  feasible  for   temporal  scanning  processors: 

•  scanning  mask  and 

•  scanning  detector. 

The  most  compact  real-time  implementation  is  a  scanning  detector  system,  alluded  to  earl- 
ier, 2, 1  which  uses  an  area  array  ICCD  as  the  scanning  detector.     In  this  system  the 
image  is  electronically  shifted  within  the  ICCD  resulting  in  no  actual  spatial  movement. 
This  type  of  design  is  the  most  promising  for  future  real-time  signal  processing  applica- 
tions and  is  the  subject  of  the  next  section. 
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ICCD  Based  EOP 


Figure  4  shows  the  main  components  of  the  EOP  in  conceptual  block  diagram  form. 


AREA  ARRAY  CCD  TO 
PERFORM  DETECTION, 
DELAY  AND  SUMMATION 
OF  SIGNAL 


SYSTEM 
CONTROL 


Figure  4.     Electro-optical  processor  conceptual  block  diagram 


The  function  of  each  component  and  its  relation  to  overall  system  operation  will  be  dis- 
cussed in  a  qualitative  manner.     A  more  detailed  description  and  model  can  be  found  else- 
where. ° '  9 

The  time  varying  input  signal,   f(t)   modulates  the  output  light  intensity  of  the  light 
emitting  diode   (LED)   to  produce  a  spatially  uniform  light  field.     An  electrical-to- 
optical  conversion  efficiency,  N]_,   is  associated  with  the  conversion  process  to  account 
for  the  inefficiency  of  the  conversion  performed  by  the  LED.     For  purposes  of  the  quali- 
tative system  description,   f(t)   will  be  considered  as  a  sampled  data  signal,   f(n)  even 
though  a  sample  and  hold   (S/H)   function  is  not  indicated  prior  to  the  LED.     In  this  case 
the  CCD,  operating  in  its  time-delay-and-integrate   (TDI)   mode,   actually  performs  the  sam- 
pling function.  Therefore: 

fL(t)    =  NL  f(t)-~    NJ_  f(n)    =  fL    (n)  (20) 

Note  that  the  system  can  operate  on  the  output  of  a  S/H  circuit  provided  the  S/H  is  in 
synchronization  with  the  CCD. 

The  spatial  tr ansmittance  function  of  the  mask  takes  the  form 

N-l       M-l  /  a      \  / 

T(x,y)    -     £         £       rect    /  \    rect    /  \  (21) 


where  k,  m,   A  and  B  are  defined  in  Figure  5  and 

a.      =  c  h.  (22) 
km  km 

K  =  B/A  (23) 

where  C  is  a  constant  used  to  scale  the  coefficient  values  to  units  of  area.     It  is 
important  to  note  that  the  coefficient  values,   h^m,   represent  the  values  of  a  matrix  to 
be  multiplied  by  the  input  vector,   f(n).     Also  each  elemental  cell,    (k,m),  of  the  mask  is 
aligned  with  the  corresponding  pixel  of  the  CCD  array.     As  shown  in  Figure  5,   the  values 
of  a^m  are  encoded  into  the  mask  of  the  processor  by  varying  the  area  of  the  aperture 
in  each  elemental  cell  of  the  transparency. 
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Figure  5.     Optical  memory  mask 
The  mask  consists  of  a 
rectangular  elements. 
i,jth  element  is 


showing  only  the  i,jth  element, 
rectangular  array  of  NxM 

The  clear  area  of  the 
Kaij,  where  K  =  B/A 


As  the  light  passes  through  the  transparency,   the  required  multiplication  occurs,  i.e., 

fT (x,y;n)    =  fL(n)  r(x,y) .  (24) 

The  uniform  irradiance  pattern  described  by  equation   (24)    is  incident  upon  the  CCD  array 
for  each  input  cycle  and  is  detected  and  stored  by  the  CCD  array  as  an  NxM  array  of 
analog  charge  packets.     To  describe  the  charge  ENTERING  the  i,jth  element  of  the  CCD  due 
to  the  nth  input  cycle,  Qnj.j,   the  following  expression  can  be  written: 

A(i+3/2)         B(j+3/2)        (nt  +d) 
Qnin  f  f  f°  f T(x,y;n)dxdydt  25) 

J       X=A(i+l/2)     y=B(j+l/2)  t=nt0 

where  tQ  is  an  arbitrary  starting  time.     Performing  the  integration  results  in: 

Q  •  ■  =  N,  N-.  dcKh  .  .f  (26) 
vni3         1  2  i]  n 

where  N2  is  the  optical-to-charge  conversion  efficiency  and  fn  =  f(n). 

The  charge  packets  of  equation   (26)   can  be  caused  to  move  vertically  by  applying  a 
vertical  clock  pulse  to  the  CCD.     As  these  packets  travel  vertically  with  each  successive 
clock  pulse,  more  and  more  charge  is  added  to  them.     The  net  result  is  that  each  column 
of  the  CCD  performs  a  running  sum  of  products  of  time  varying  LED  radiance  and  space 
varying  transmittance  values  of  that  column  of  the  mask: 

=  C       V1       f         h  n=... -2, -1,0, 1,2...  (27) 

qn,m      ul  n-k  nk,m      m  =  0,1, ...M-l  ^" 

where 

Cl  =  N1N2  dcK'  (28) 

These  quantities,   qn  m,   are  shifted  into  the  CCD  par allel- to-ser ial  register  and  then 
shifted  horizontally'out  of  the  system  as  the  desired  output  vector. 

f         ,  n=... -2, -1,0, 1,2...  . ~  9  v 

yn,m      ^2     £Q         n-k     k,m       m  =  0,1, ...M-l  v  ' 
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where 


C2  =  N3  C1  (30) 

and  N3   is  the  charge-to-voltage  conversion  efficiency  of  the  CCD  output  circuit.     It  is 
important  to  note  that  the  horizontal  clock  rate,   i.e.,   the  clock  rate  of  the  CCD  paral- 
lel-to-serial register,  must  be  M  times  the  vertical  clock  rate  to  ensure  that  the  regis- 
ter will  be  empty  when  the  subsequent  row  of  charge  packets  is  ready  to  be  read  out. 

The  processing  scheme  outlined  above  will  continue  until  the  input  is  exhausted.  Due 
to  the  inherent  design  of  the  imaging  CCD,   this  processor  operates  as  a  sliding  window  on 
the  input  sequence.     Each  sequential  output  vector  is  a  function  of  one  input  sample  that 
is  not  included  in  the  preceding  output  vector.     A  recommendation  made  by  Bromley^  and 
Monahan^   is  that  CCD  manufacturers  investigate  adding  a  controllable  charge  dump  fea- 
ture into  the  area  array  imaging  CCDs.     This  would  permit  controllable  dumping  of  charge 
packets  between  the  vertically  shifting  array  and  horizontally  shifting  buffer  register 
to  permit  any  desired  amount  of  input  data  processing  overlap.     For  example,  a  50%  over- 
lap is  typical  in  many  applications;   therefore,  by  dumping  appropriate  rows  of  charge 
packets,   the  50%  overlap  could  be  implemented  resulting  in  increased  system  throughput 
rate . 

A  single  channel  version  of  the  EOP  direct  form  models/9  described  by  equation  (29) 
is  shown  in  Figure  6. 


Figure  6.     EOP  direct  form  model 
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Any  linear  transformation  that  takes  the  form  of  equation  (30)/Figure  6  can  be  imple- 
mented by  an  EOP. 

Figures  7a  and  7b  show  several  EOP  configurations  that  have  been  investigated  by 
Monahan  and  Bromley. 7 


Figure  7a.     Exploded  view  of  an  EOP  configuration  examined  by  Monahan  and  Bromley 


Figure  7b.     An  EOP  implementation  yielding  more  uniform  illumination  of  the  mask/ICCD 


Both  are  compact  implementations  that  have  the  mask  etched  on  the  face  of  the  ICCD.  The 
implementation  shown  in  Figure  7b  results  in  a  more  uniform  illumination  of  the  mask/ICCD. 
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Abstract 

Acousto-optical  computation  of  continuous  and  discrete  Fourier  transforms  have  been 
performed  using  a  time-integrating  architecture.  Time-integration  in  conjunction  with  diode 
lasers  and  bulk  optics  can  be  used  to  produce  inherently  compact  optical  systems,  and 
several  compact  processors  have  been  demonstrated.  Performance  parameters  and  tradeoffs 
have  been  analyzed  for  these  processors,  and  present  device  limitations  identified. 
Additional  new  concepts  for  miniaturization,  including  application  of  Integrated  optics,  and 
multichannel  operation  are  discussed. 

Introduction 

There  has  been  much  recent  work  in  the  application  of  acousto-optics  (A-0)  to  signal 
processing. ( 1 )  One  of  the  main  attractions  of  acousto-optics  is  that  it  affords  a  rapid, 
effective  means  of  data  input,  with  gigahertz  bandwidth  and  good  dynamic  range  acoustic 
devices  now  available.  Combining  the  large  bandwidth  capability  with  the  Fourier 
transforming  properties  of  a  lens  allows  for  rapid  processing  of  one-dimensional  (1-D) 
signals.  Processing  speed,  however,  is  not  the  only  important  consideration,  so  use  of  a 
lens  to  transform  the  information  on  an  acoustic  wave  is  not  necessarily  the  best  method  of 
performing  Fourier  transforms  in  all  instances.  An  alternative  method  for  Fourier 
transformation  is  to  employ  chirp  algorithms;  this  method  possesses  several  advantages.  For 
example,  there  is  flexibility  to  optimize  a  desired  parameter  such  as  speed  or  resolution 
data  input  can  be  in  a  variety  of  formats,  and  it  is  easier  to  obtain  phase  information 
compared  to  the  lens-transform  method.  One  particularly  useful  chirp  algorithm  is  the 
following.     The  1-D  Fourier  integral  is  defined  as 


F(ai)  =  /       S(t)  exp(-iatt)dt,  (1) 

—00 

where  ax   can  be  considered  the  frequency  variable.     Using  the  identity 


-it  =  1/4  [(t-i)2  -  (t+r)2]  , 

Equation  (1)   is  changed  into 

P(ot)   =  /"  S(t)   exp[+i(a/4)    (t-r)2]   exp [-i ( a/4 ) ( t+t ) 2]  dt.  (2) 


This  algorithm  is  applicable  for  both  continuous  and  discrete  Fourier  transforms  and  has 
been  referred  to  as  the  triple-product-convolver  algorithm ( 2 ' ,  since  there  are  now  three 
input  terms  and  the  exponential  terms  in  Eq.  (2)  are  chirps  of  equal  but  opposite  slopes. 
It  is  possible  to  Implement  Eq.  (2)  either  with  A-0(3)  or  with  purely  electronic  filtering 
approaches .( ^ )  However,  use  of  the  optical  techniques  to  be  discussed  here  offers 
additional  advantages  and  solutions  to  problems  encountered  with  previous  A-0  approaches. 
For  example,  by  using  various  time-integrating  (TI)  optical  architectures,  diode  laser  light 
sources,  and  the  chirp  algorithm  represented  by  Eq.  (2),  one  can  obtain  simplified  and 
miniaturized  processors,  easy  computation  of  either  continuous  or  discrete  Fourier 
transforms  and  increased  data  throughput.  Although  this  paper  will  be  primarily  concerned 
with  TI  processing,  also  Included  will  be  some  discussion  of  space-integrating  (SI) 
Fourier-transform  processors,  comparing  with  TI  processors  and  discussing  their  optical 
implementation. 
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Basic  Processor  Operation 


Continuous  Fourier  Transforms 

To  Implement  Eq.  (2),  we  have  chosen  to  perform  the  multiplication  by  interference  of 
information-carrying  light  beams  (produced  by  acousto-optic  diffraction)  and  to  perform  the 
integration  by  time-integration  of  the  light  signal  on  a  detector  array.  The  core  of  the 
optical  signal  processor  is  the  modified  Mach-Zehnder  interferometer  as  shown  in  Fig.  1.  An 
additive,  time-integrating  scheme  was  adopted  rather  than  multiplicative  ones  such  as 
described  elsewhere . ^5 )  The  optical  output  of  a  diode  laser,  after  collimation,  Is  split 
into  two  beams  of  equal  intensity  by  the  first  beamsplitter.  Each  beam  is  then  directed 
through  a  Bragg  cell.  The  laser  diode  beam  was  collimated  along  the  direction  transverse  to 
laser  junction  plane  to  maximize  beamwidth,  and  compressed  along  the  laser  Junction 
direction.  The  Bragg  cells  are  oriented  so  that  two  positive  first-order  diffracted  beams 
are  produced.  These  diffracted  beams  are  combined  at  a  second  beam  splitter  and  detected  by 
a  square-law  photodetector  array.  In  this  implementation  of  the  Fourier  transform,  the 
Bragg  cells  are  driven  in  the  opposite  directions  by  the  same  linear-FM  (chirped)  RF 
waveform.  The  Bragg-dif f racted  optical  waves  will  be  phase-modulated  to  yield  the  two 
exponential  terms  of  opposite  delays.  The  diode  laser  is  intensity-modulated  by  the  signal 
S(t).  When  the  two  diffracted  fields  are  mixed  in  the  square-law  detector  array  and 
time-integrated  over  a  period  T,  the  electrical  charge  from  the  detector  array  consists  of 
three  terms:  two  bias  terms  proportional  to  Intensities  of  the  optical  beams  in  the  two 
legs  of  the  interferometer,  and  a  cross  term  containing  the  desired  Fourier  transform  of  the 
signal  modulating  the  diode  laser.  Assuming  that  the  path  lengths  and  optical  intensities 
in  the  two  legs  of  the  interferometer  have  been  made  equal,  then  the  charge  representing  the 
cross  term  is : 

qT  a  4Re{exp(-2iu0T )   /     S(t)   exp (-2iax t )dt }    ,  (3) 

if  the  linear-FM  RF  wave  is  represented  by  cos  [u>0t  +  (  a/2 )  1 2  ] ,  where  <d0  is  the  carrier 
angular  frequency,  a  is  the  chirp  acceleration,  and  t  Is  the  time  delay  incurred  in  the 
Bragg  cells  (=z/v,  where  z  is  the  spatial  coordinate  along  the  acoustic  wave  and  v  is  the 
acoustic  velocity  in  the  Bragg  cell).  By  recognizing  at  as  an  instantaneous  frequency,  the 
integral  in  Eq.  (3)  becomes  the  Fourier  transform  of  the  signal  S(t).  Hence,  the  spectral 
information  is  impressed  onto  a  spatial  frequency  u)0Av.  The  frequency  resolution  is 
approximately  1/T,  the  reciprocal  of  the  detector  integration  time.  The  number  of 
resolvable  elements  is  aT  xmax, where  tmax  is  the  maximum  delay  of  the  Bragg  cell.  The 
quantity  aT  is  the  chirp  bandwidth.  When  an  angle  between  the  wavefronts  of  the  two 
combined  beams  is  introduced  by  changing  the  orientation  of  one  of  the  turning  mirrors,  an 
additional  phase  factor  k^ez  should  be  included  in  the  spatial  carrier,  where  is  the  wave 
number  of  the  optical  wave  and  9  is  the  angle  between  the  two  wave  fronts.  This  degree  of 
freedom  allows  us  to  select  the  optimum  value  of  the  spatial  frequency  such  that  It  is  well 
resolved  by  the  detector  array. 

Discrete  Fourier  Transforms 

The  discrete  Fourier  transform  (DFT)  is  commonly  used  to  transform  sampled  temporal 
data  into  the  frequency  domain,  and  for  applications  involving  discrete  time  or  space 
variables,  such  as  pulsed  radar  or  array  beamforming.  The  ability  to  optically  implement 
the  DFT  involves  proper  formatting  of  the  input  data.  For  an  A-0  implementation,  it  is 
advantageous  to  implement  the  DFT  as  the  discrete  form  of  Eq.  (2). 


N— 1                       iir(n-k)2  -i7i(n+k)2 
F(k)  =  n|     Sn  (t)  exp(  j^j—  )  exp(  ^—  )  (4) 

Here  the  required  multiplications  and  summation  are  performed  as  described  above  for  the 
continuous  Fourier  transform.  A  space-integrating  (SI)  scheme  has  been  used  elsewhere  to 
obtain  F(k).(°)  In  the  SI  scheme  an  N-point  transform  is  obtained  by  using  N  light  beams 
equally  spaced  along  acoustic  delay  lines;  each  of  the  individual  data  points  Sn  is  carried 
by  a  separate  light  beam  as  an  intensity  modulation,  the  multiplications  done  via  successive 
acousto-optic  diffractions,  and  the  light  beams  summed  by  focussing  onto  a  single 
photodetector.  Here  the  data  is  inserted  as  a  time  sequence  and  the  summation  is  achieved 
by  time  integration  at  the  photodetector  array.  The  SI  scheme  is  appropriate  if  N  parallel 
channels  of  data  exist  and  it  is  not  possible  or  desirable  to  time  multiplex.  The  TI  scheme 
Is  appropriate  if  the  data  is  in  a  single  temporal  stream,  either  naturally  or  through 
multiplexing.       A     TI     DFT  processor     has  been     investigated     here  because     of  its  inherent 
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simplicity  of  implementation  and  its  potential  for  miniaturization.  The  data  can  be 
impressed  onto  the  light  beam  via  direct  modulation  of  a  diode  laser  which  offers  compact 
size  and  high  efficiency. 

One  can  derive  from  Eq.  (3)  the  manner  in  which  the  DFT  is  obtained  with  the  TI  processor 
and  some  of  the  characteristics  of  the  transform  output.  If  one  considers  a  sampled  data 
sequence  of  arbitrary  values,  one  may  express  S(t)  as 

N-1 

S(t)   =  S(nTs)   =     I     Sn(t)  6(t-nTs), 
n=0 

where  Ts  is  the  sampling  period  and  Sn(t)  is  the  Individual  value  of  the  n^  data  sample. 
Thus  the  integral  in  Eq.    (3)  becomes: 

T 

/  S(t)exp(-2iatt)dt 
o 


N-1 

=     I       Sn(nTs)   exp(-2iarnTs)  (5) 
n=0 


provided  that  Sn(t)     =  0     for  t  outside     of  the     integration  interval     (0,T).       Eq.    (5)  is, 

therefore,  of  the  form  of     a  DPT  when  the  signal     is  a     discrete  data  sequence.       It  can  be 

expressed  exactly  in  the  form  of  a  DFT,  if  -2iatnTs  is  made  equal  to  -2irink/N,  i.e.,  k  = 
( azNTs )/ (it  v)  .  The  largest  interval  of  k  that  can  be  observed  is 


2aTz 

max 

Ak  =    (6) 

TTV 


where  zmax  is  the  length  of  the  illumination  of  the  acoustic  wave,  and  NTS  has  been  assumed 
to  be  equal  to  T.  Thus,  while  it  is  obvious  there  is  no  limit  on  N  (the  input  transform 
size),  for  large  N  the  output  display  does  not  represent  the  entire  transform  space  unless 
the  remaining  parameters  on  the  right  side  of  Eq.  (5)  are  changed.  However,  it  is  very 
often  advantageous  to  have  a  restricted  output  region,  such  as  when  a  certain  spectral 
region  must  be  examined  with  high  resolution.  Output  times  or  data  rates  are  thereby 
reduced,  since  uninteresting  regions  of  the  spectrum  are  not  covered. 

From  Eqs.    (3)  and  (5)  one  notes  that  the  detector  output  Is 


qT  a   4Re  [exp(-2ito0T )    •   F(k)]    ;  (7) 

indicating  that  the  envelope  of  the  spatial  carrier  is  the  modulus  of  the  DFT. 

Experimental  Results 

DFT 

Experiments  were  to  evaluate  the  performance  of  the  TI  DFT  processor  were  carried  out 
using  equal-amplitude,  equally-spaced  pulses  with  the  arrangement  of  Fig.  1.  A  4-mW  diode 
laser  (single-mode  under  CW  conditions)  and  bulk  LINbC^  Bragg  cells  (1-GHz  center  frequency, 
1%  deflection  ef f iciency /rf  watt  and  2-usec  window)  were  used.  Data  was  recorded  from  a 
1024-element  time-integrating  Reticon  array  using  a  digitizing  oscilloscope  having  12-bit 
resolution.  By  reversing  the  phase  of  one  of  the  chirps  and  substracting  the  result  from 
the  previous  frame,  one  eliminated  fixed  pattern  noise  and  signal-dependent  bias  terms. 
Each  frame  represented  a  40  msec  integration  time.  Transforms  of  data  sequences  consisting 
of  different  numbers  of  pulses  were  performed  and  the  results  are  shown  in  Figs.  2  and  3. 
The  speckled  appearance  in  the  figures  is  due  to  the  digitization  of  the  spatial  fringe 
carrier,  DFT ' s  with  N=4  and  N=8  are  shown  in  Figs.  2a  and  2b,  and  those  with  N=64  and  N=128, 
in  Fig.  3,  respectively.  That  Fourier  spectra  have  been  obtained  is  further  supported  by 
the     results       of     simulation      done     on     a      waveform     analyser       (an     FFT     device).  For 
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comparison,  the  corresponding  PPT  spectra  are  also  given  in  Pig.  2.  The  qualitative 
agreement  is  excellent,  both  following  the  expected  (sinNx)/(N  sinx)  dependence.  Por 
transforms  with  large  N,  the  window  of  the  frequency  output  space  given  by  Eq.  (6)  was  not 
large  enough  to  show  any  two  adjacent  principal  lobes.  Thus,  in  the  case  of  the  128-point 
transform  of  Fig.  3,  comparison  was  made  of  the  peak-to-f irst-sidelobe  ratio.  This  ratio  is 
0.23  as  compared  to  a  theoretical  value  of  0.212,  again  in  good  agreement.  The  symmetry  of 
the  sidelobes,  however,  does  depend  strongly  on  the  chirp  linearity  and  environmental  phase 
perturbations  such  as  vibrations  and  turbulence. 

The  maximum  spectral  component  available  in  the  output  window  is  governed  by  the  chirp 
bandwidth  aT  and  the  maximum  cell  delay  zmax/v,  as  indicated  by  Eq.  (6).  In  practice,  it  is 
much  easier  to  change  aT  than  to  change  the  latter,  because  of  the  limitations  in  Bragg  cell 
sizes  and  materials. 

The  device  used  here  was  not  optimized  for  maximum  dynamic  range,  since  no  attempt  was 
made  to  optimize  the  Bragg  cell  deflection  efficiency  or  the  diode  laser  optical  power. 
Optimum  operation  would  require  that  the  maximum  output  signal  from  the  detector  array  be 
close  to  the  saturation  level  so  as  to  maximize  the  dynamic  range  utilization  of  the  array 
for  a  given  integration  time  (or  frequency  resolution).  Some  measurements,  however,  were 
carried  out  to  determine  the  maximum  dynamic  range  available  under  the  present  setup.  These 
dynamic  range  results  are  shown  in  Fig.  4.  Input  dynamic  range  could  only  be  maintained  in 
the  range  of  20  dB  for  digital  Fourier  transformations  with  40  msec  integration  time. 
Beyond  this  range,  saturation  occurred  rapidly.  The  fact  that  a  degradation  in  fringe 
visibility,  relative  to  cw  operation,  occurred  whenever  the  diode  laser  was  operated  In 
pulsed  mode  (biased  at  threshold)  indicates  that  the  coherence  of  the  optical  output  has 
strong  impact  on  the  available  dynamic  range.  Independent  measurements  on  the  spectral 
contents  of  the  diode  laser  showed  multimode  emission  under  pulsed  operations.  Reported 
measurements  of  both  the  spatial  and  temporal  coherence  of  pulsed  diode  lasers  (?)  support 
this  view.  We  also  observed  that  the  reduction  in  visibility  is  not  uniform  across  the 
output  window.  The  observed  phenomenon  occurred  to  various  degrees  for  different  lasers, 
and  suggests  the  strong  need  for  single  mode  control  under  pulse  modulation.  A  promising 
method  for  doing  so  using  optical  feedback  injection  locking  has  been  reported. (°) 

Increased  speed  and  throughput  beyond  what  was  demonstrated  is  possible,  subject  to  the 
constraints  given  by  Eq.  (6).  The  present  limit  on  the  transform  speed  is  the  serial 
read-out  time  of  the  detector  array  (3  msec  here).  The  read-out  time  however,  can  be 
increased  by  use  of  faster  clock  circuits  for  -the  array  (a  factor  of  10  is  easily  possible 
with  silicon-based  devices  such  as  the  Reticon  used  here),  or  by  use  of  special  parallel 
readout  arrays.  The  ultimate  limit  on  the  transforming  time  is  the  chirp  duration.  When 
large  bandwidth,  short  duration  chirps  are  used,  there  is  a  corresponding  requirement  for 
high-speed  modulation  of  the  diode  laser  over  large  dynamic  ranges.  Larger  power  output 
from  the  diode  laser  and  higher  Bragg  cell  deflection  efficiency  are  also  important 
considerations,  since  it  is  desirable  to  have  the  maximum  output  signal  level  at  close  to 
the  saturation  level  of  the  detector  array  to  fully  use  the  dynamic  range  of  the  array. 
Linear  dynamic  range  can  also  be  increased  by  using  an  electronic  drive  circuit  having 
feedback  from  a  photodetector  monitoring  the  laser  output  power.  The  drive  circuit 
maintains  a  linear  relationship  between  the  output  power  and  input  signal  level. 

Continuous  Fourier  Transforms 

Continuous  Fourier  transforms  were  also  demonstrated  using  the  arrangement  in  Fig.  1.  In 
this  case  the  diode  laser  was  operated  CW  at  a  half -maximum  power  level,  and  the  information 
signal  S(t)  was  added  to  the  DC  drive  voltage.  Single-frequency  tones  of  up  to  1  kHz  were 
used  for  S(t).  The  results  obtained  with  a  single  tone  at  200  Hz  and  with  two  simultaneous 
tones  at  400  and  900  Hz  are  shown  in  Pigs.  5  and  6  respectively.  Peaks  were  produced  at 
positions  proportional  to  frequency,  and  the  widths  of  the  peaks  obtained  were  inversely 
proportional  to  the  integration  time,  as  expected.  Dynamic  range  measurements  were 
obtained,  although  as  in  the  case  of  the  DPT  earlier,  the  processor  dynamic  range  was  not 
optimized.  The  dynamic  range  results  shown  in  Fig.  4  give  a  dynamic  range  figure  of 
approximately  20  dB.  It  is  Important  to  note  that  the  dynamic  range  for  the  multiple-tone 
case  is  strongly  affected  by  any  non-linear  response  of  the  diode  laser  to  the  drive  signal. 
Nonlinearities  in  the  characteristic  curve  of  the  diode  laser  will  lead  to 
intermodulation-product  signals.  For  the  lasers  used  here,  intermodulation  signals  were  in 
fact  the  limiting  factor  for  the  dynamic  range  for  the  case  of  two  equal-amplitude  tones. 
It  is  possible  to  correct  nonlinearities  in  the  characteristic  curve  and  hence  increase 
dynamic  range,  by  using  an  electronic  feedback  network. 
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Compact  Processors 


Compaction  of  the  Fourier  transform  architecture  used  here  is  desirable  not  only  to 
achieve  practical  package  sizes  but  also  to  minimize  the  lengths  of  non-common  paths  in  this 
interf erometric  arrangement.  The  Mach-Zehnder  shown  in  Fig.  1  occupied  a  laboratory  bench 
area  of  about  50  cm.  by  25  cm.  Compaction  can  obviously  be  obtained  with  shorter 
focal-length  lenses  and  smaller  optical  components  and  mounts.  Optical  component 
miniaturization  is  finally  limited  by  the  size  of  the  Bragg  cells  required  to  obtain  desired 
frequency-band  coverage.  However,  the  TI  architecture  here  offers  further  compaction 
capability  because  a  variety  of  arrangements  can  be  used  to  implement  the  basic  algorithm 
(Eq.    (2)).     Several  arrangements  have  been  devised  and  demonstrated. 

(i)  One  notes  that  an  additive  architecture  was  selected  and  implemented  on  a 
Mach-Zehnder  interferometer  as  described  earlier.  However,  the  Mach-Zehnder 
can  be  folded  back  upon  itself  so  that  the  original  beam  splitter  also 
serves  as  the  beam  combiner  for  the  two  diffracted  light  beams.  This  scheme 
is  illustrated  in  Fig.   7,  and  is  essentially  a  Twyman-Green  interferometer. 

(ii)  The  Twyman-Green  arrangement  still  requires  two  separate  Bragg  cells,  and 
the  diffracted  beams  are  not  on  a  common  path  until  the  beam  combiner  is 
reached.  By  using  bulk  optic-surface  acoustic  wave  (SAW)  techniques 
developed  recently^")  one  can  provide  an  entirely  common  path  for  the 
diffracted  beams,  while  also  having  the  two  Bragg  deflectors  (SAW  delay 
lines)  on  a  common  substrate.  This  scheme  is  illustrated  in  Fig.  8.  Two 
SAW  transducers  are  laid  on  opposite  ends  of  a  single  substrate  of  LiNbO^. 
Because  of  the  anisotropic  properties  of  Y-Z  cut  LiNbO^,  by  tilting  the 
transducers  relative  to  each  other  by  the  proper  amount,  two  incident  light 
beams  can  be  made  to  interact  independently,  one  with  each  SAW  signal, 
without  crosstalk  between  the  two  signals (9).  Therefore,  one  needs  to 
generate  the  two  incident  light  beams  at  the  correct  angles  and  in  a  compact 
system.  Fig.  8  illustrates  one  way  in  which  the  two  light  beams  were  here 
generated.  A  bi-mlrror  was  used  along  with  a  single  light  source.  A  common 
folded  path  was  used  to  simultaneously  produce  the  angular  separation  of  the 
two  beams,  the  lateral  beam  expansion,  and  the  vertical  beam  focussing.  An 
alternative  method  that  can  be  used  to  generate  the  two  angled  beams  in  a 
small  volume  is  to  use  a  beam-splitting  prism  such  as  used  for  laser  doppler 
velocimeters . ( *0 )  However,  the  beam  shaping  optics  then  cannot  occupy  the 
same  volume.  On  the  exit  side  of  the  SAW  line  a  single  spherical  lens 
re-collimates  the  light  in  the  vertical  direction  and  simultaneously  acts  as 
the  first  lens  in  a  Schlieren  system.  Following  a  slit  and  the  second 
Schlieren  lens,  the  light  is  deflected  vertically  into  the  photodetector 
array  for  further  compaction.  DFT  results  have  been  obtained  with  this 
acousto-optic  SAW  processor.  Results  are  identical  to  those  described 
earlier,  except  for  one  further  advantage.  Because  of  the  slower  SAW 
velocity  (~3.5xlo5  cm/sec  versus  ~  6.5xl(P  cm/sec  for  bulk  LiNbC^)  and  the 
low  acoustic  diffraction  and  attenuation  (making  longer  acoustic  lines 
possible)  the  observable  interval  of  frequency  space  is  larger  (see  Fig. 
(6)). 

(iii)  In  the  description  of  the  basic  TI  processor  above,  it  was  noted  that  the 
same  rf  chirp  waveform  is  used  to  drive  both  Bragg  cells.  The  two 
differing  waveforms  required  in  Eq.  (2)  result  from  wave  travel  in  opposite 
directions,  since  it  is  the  spatial  waveform  that  is  of  importance.  By 
recognizing  this  spatial  dependence  it  is  possible  to  eliminate  one  of  the 
Bragg  cells  and  have  two  light  beams  impinge  contra-directionally  onto  the 
remaining  cell  in  a  ring  configuration. (ll )  This  configuration  is  shown  in 
Fig.  9.  The  paths  for  both  light  beams  are  exactly  equal  length  and  both 
encounter  virtually  identical  perturbations.  DFT  results  were  obtained  with 
this  configuration.  Stability  against  environmental  perturbations  was 
excellent.  Phase  reversals  of  one  light  beam  can  be  used  to  eliminate 
fixed-pattern  noise  via  frame-to-frame  subtraction,  as  described  earlier. 
These  phase  reversals  were  achieved  here  by  Inserting  a  microscope  slide 
into  one  beam  at  a  point  where  the  two  beams  are  separated.  Phase  reversal 
can  also  be  performed  by  inserting  an  electro-optic  modulator  into  the  ring 
and  introducing  a  90°  phase  shift  on  each  beam. 

Integrated  optics  must  also  be  considered  as  a  candidate  approach  for  compact  devices. 
The  TI  bulk  optics  -SAW  processor  above  could  be  Implemented  on  a  structure  very  similar  to 
the  integrated-optic  rf  spectrum  analyzer  demonstrated  recently . ( 12 )     A  method  for  producing 
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two  coherent,  angled  light  beams  within  the  waveguide  layer  is  needed,  and  a  method  for 
rejecting  undiffracted  light  (e.g.,  polarization  rejection)  may  need  to  be  perfected 
if  a  Schlieren  system  is  impractical  due  to  substrate  size  requirements.  One  possible 
disadvantage  of  Integrated  optics  is  that  it  is  limited  to  planar  geometries.  Integrated 
optics  may  in  fact  be  best  suited  to  the  SI  DPT  architecture. 

As  mentioned  earlier,  the  SI  DFT  processor  requires  N  information-carrying  light  beams 
for  an  N-point  transform.  A  major  problem  of  this  scheme  is  how  to  produce  the  required 
beams  if  N  is  large.  N=8  has  been  demonstrated  using  bulk  optics, (oj  but  extrapolation 
beyond  N=64  appears  to  be  difficult.  A  scheme  for  obtaining  a  large  number  of  Independent 
intensity-modulated  beams  using  integrated  optics  is  shown  In  Pig.  10.  The  light  beams  are 
produced  as  the  outputs  of  integrated-optical  Mach-Zehnder  modulators.  Such  devices  have 
been  fabricated  using  single-mode  optical  waveguides . ( 13 )  These  devices  had  3-um-wide 
guides,  branching  angles  of  2°  at  the  Y's,  electrode  lengths  of  several  mm  and  a  spacing  of 
25  um  between  the  arms.  It  is  possible  to  achieve  as  much  as  30  dB  intensity  modulation 
using  drive  voltages  of  no  more  than  10V  with  these  devices. (1^)  It  Is  also  possible  to 
build  a  waveguide  branching  network,  as  shown  in  Pig.  10,  where  several  modulators  utilize 
the  same  CW  laser  diode  source.  Since  the  laser  diode  sources  must  be  butt  coupled  to  the 
edge  of  the  waveguide  layer,  it  is  desirable  to  minimize  the  number  of  laser  diodes.  It  is 
also  for  this  reason  that  direct  laser  diode  modulation  is  unattractive.  With  modulators 
built  on  LiTaOj  or  LiNbOg  substrates,  it  is  straightforward  to  Incorporate  SAW  transducers. 
Contratravelling  SAW  chirp  waveforms  would  interact  with  light  in  channel  waveguides  coming 
off  the  output  Y  of  the  modulators.  With  achievable  SAW  delay  line  lengths,  it  should  be 
possible  to  have  N  in  excess  of  100. 

Multichannel  Processing 

The  characteristics  of  the  time-integrating  Pourier-transf orm  processor  described  above 
lend  itself  easily  to  multichannel  operation  for  large  throughput.  Since  the  chirp 
waveforms  are  independent  of  the  signal  (although  synchronized  with  it)  multichannel 
operation  does  not  require  any  replication  of  the  acoustic  cells.  A  multiplicity  of  input 
channels  can  be  obtained  using  any  of  a  variety  of  technqiues.  Fig.  11  Illustrates  one 
technique  using  an  array  of  laser  diodes  and  a  corresponding  array  of  lenslets.  Detection 
is  then  done  with  a  two-dimensional  time-integrating  photodetector  array,  each  row  of  the 
array  corresponding  to  a  different  processing  channel.  Alternatively,  it  is  possible  to  use 
a  parallel-addressed  one-dimensional  spatial  light  modulator  to  provide  multi-channel 
modulation  of  a  single,  wide  laser  beam.  -Candidate  modulators  include  total  Internal 
reflection  electro-optic  devices  that  require  only  5V  for  full  contrast  and  which  can  be 
addressed  using  microelectronic  structures . (15 )  use  of  microelectronic  structures  has  been 
shown  to  provide  a  potential  capability  of  more  than  5000  independent  channels  on  a  single 
crystal-microchip  assembly. 
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Fig.   1.     Experimental  arrangement  for  time- 
integrating,  acousto-optic  DPT 
processor. 


Fig.   2.     Results  obtained  with  the  DPT 
processor  on  an  equi-spaced, 
uniform  amplitude  pulse  train 
consisting  of   (a)  4  pulses  and 
(b)   8  pulses  with  the  corres- 
ponding PFT  results   (c)  and  (d), 
respectively . 


Fig.   3.     Result  obtained  with  the  DPT 

processor  on  an  equi-spaced,  uniform 
amplitude  pulse  train  consisting  of 
64   (left)  and  128  (right)  pulses. 
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Pig.   5.     Continuous  Fourier  transform  Fig.   4.     Dynamic  range  results  obtained  with 

result  for  single  tone  at  200  Hz  TI  Fourier  transform  processor, 

input  into  TI  processor. 
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Continuous  Fourier  transform  result 
for  two  simultaneous  tones  at  400  Hz 
and  900  Hz  input  into  TI  processor. 


SPH3pq 


A 


Vi 


Bl  MIRROR       LASER  NS3(t)  SAW 
DIODE  DEVICE 


TIME  INTEGRATING 
DETECTOR  ARRAY 


\ 

T7 


Fig.  8. 


Bulk  optics 
processor . 


DETECTOR 
ARRAY 

-  SAW  Fourier  transform 


Fig.  7. 


BIAS 

I 

71— s 


SCHLIEREN 
IMAGING 
LENSES 


;s3iti  s2n  +  ti  s,it-Ti  dt 


LINEAR  PHOTODIODE 
ARRAY 


Fourier  transform  processor  using 
Twyman-Green  arrangement  and  folded 
optical  path. 


j> 


MO      BEC  BS 


DP 

ur  Abs 


Fig.  9. 


DA 


Fourier  transform  processor  using 
only  one  Bragg  cell  in  a  ring 
configuration. 


Fig.   10.     Space-integrating  DFT  processor  Fig.   11.     Multichannel  TI  DFT  processor  using 

using  integrated  optic -waveguide  laser  diode  array, 

modulators,  branching  networks 
and  SAW  delay  line. 
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Robust  adaptive  thresholder  for  document  scanning  applications 

To  R.  Hsing 
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Abstract 

In  document  scanning  applications,  thresholding  is  used  to  obtain  binary  data  from  a  scanner.  However, 
due  to:  (1)  a  wide  range  of  different  color  backgrounds;  (.2)  density  variations  of  printed  text  information; 
and  (3)  the  shading  effect  caused  by  the  optical  systems,  the  use  of  adaptive  thresholding  to  enhance  the 
useful  information  is  highly  desired. 

This  paper  describes  a  new  robust  adaptive  thresholder  for  obtaining  valid  binary  images.  It  is  basically 
a  memory  type  algorithm  which  can  dynamically  update  the  black  and  white  reference  level  to  optimize  a  local 
adaptive  threshold  function.  The  results  of  high  image  quality  from  different  types  of  simulate  test  patterns 
can  be  obtained  by  this  algorithm. 

The  software  algorithm  is  described  and  experiment  results  are  present  to  describe  the  procedures. 
Results  also  show  that  the  techniques  described  here  can  be  used  for  reai-time  signal  processing  in  the 
varied  applications.'"' 

i  ntroduction 

The  use  of  thresholding  to  enhance  the  digital  images  has  been  studied  extensively  for  a  long  time.1 
Different  techniques  have  been  proposed  for  automatic  threshold  selection.2  Most  of  them  were  developed  for 
enhancing  scene  pictures  and  high  contrast  images.  However,  due  to  the  complex  nature  of  documents  we 
met  today,  the  text  information  with  varied  density  printed  on  different  color  backgrounds  are  quite  common. 
Also,  the  MTF's  degradation  of  the  imaging  system  and  the  illuminator  will  contribute  a  significant  amount  of 
image  degradation.  A  fixed  threshold  and  some  developed  adaptive  threshoid  are  not  adequate  for  this 
application . 

This  paper  will  describe  a  new  memory  type  algorithm  which  can  emphasize  the  locai  property  and  provide 
more  adaptive  capability  for  document  processing  applications. 

Test  Targei 

In  order  to  evaluate  the  performance  of  proposed  adaptive  thresholder  algorithm  quantitatively,  a  simu- 
lated density  variation,  a  shading  variation  and  a  color  background  variation  test  targei  with  each  having 
8-bit/pixel  (0  to  255  gray  level)  were  used.  Figure  1  shows  a  density  variation  target.  There  are  five 
regions  with  each  having  the  same  text  information,  but  different  densities.  The  density  range  is  varied 
from  0.15  to  0.6.  A  typical  one  scan  output  of  this  target  is  also  shown  in  Figure  1,  where  it  shows  about 
20  out  of  256  gray  level,  which  comes  from  the  scanner's  electronic  system  noise. 

The  cos4  loss  of  a  lens  system  with  iiluminators  creates  shading  problems  for  any  imaging  systems.  In 
some  cases,  40%  to  50%  shading  can  be  expected.  Figure  2  shows  a  simulated  shading  variation  test  target 
with  one  reflectivity  profile  output.     There  is  about  a  50%  reduction  in  intensity  at  both  ends. 

For  practical  application,  a  document  with  text  information  on  different  color  backgrounds  can  be 
expected.  Although  the  reflectivity  of  the  text  is  the  same,  the  different  color  background  will  change  the 
constrast.     A  simulated  text  with  different  color  backgrounds  and  a  complete  scan  line  is  shown  in  Figure  3. 

Algorithm  Description 

This  is  a  memory  type  algorithm  with  a  white  reference  memory,  Pw,  and  a  black  reference  memory,  PB . 
It  is  an  immediate  past  white/black  peak  detector  with  a  reference  transition  amount  Kw  in  the  white  region 
and  a  reference  transition  amount  KB  in  the  black  region. 


*When  this  work  was  done  the  author  was  with  Xerox  Corporation,  Webster,  NY,  USA. 
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Figure  1.     (a)  The  lexl  density  variation  test  target 
with  five  regions  of  different  densities, 
and  (b)  refiecti vi ty  profile  oT  a  typical 
scan  line  through  the  target. 


Figure  2.     (a)  The  shading  effect  test  target  with 
50%  reduction  in  iight  intensity  at  both 
ends,  and  (b)  refiectivity  profiie  of  a 
line  through  the  target. 
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Figure  3.    (a)  The  text  information  on  different  color  backgrounds,  and  (b)  the  reflectivity  profile  of  a  typical 
scan  line  through  the  target. 
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As  shown  in  Figure  4,  the  Pw  and  PB  value  will  be  set  at  a  predefined  value  Pwo  and  PBO  at  the  beginning 
of  each  scan  and  then  update  them  during  the  scanning  cycle.  In  this  algorithm,  a  white  discrimination 
signal  is  generated  when  the  instantaneous  Pixel  value  Pj  of  the  video  signal  at  any  time  exceeds  the  immedi- 
ately past  black  reference  memory  value  PB  by  an  amount  Kw.  The  transition  from  white  to  black  region 
will  happen  oniy  when  the  video  signal  Pj  drops  below  the  immediately  past  black  reference  memory  PB  can- 
not exceed  Kw.  Similariiy,a  black  discrimination  signal  is  generated  when  the  immediately  past  white  refer- 
ence memory  value  Pw  exceeds  the  instantaneous  Pixel  video  signal  Pj  by  an  amount  KB.  Otherwise,  the 
transition  from  black  to  white  region  will  occur. 
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Figure  4.     The  flow  diagram  of  the  adaptive  thresholder. 


Each  time  when  the  transition  occurs,  say,  from  black  (or  white)  to  white  (or  black)  region,  the  Pw  (or 
PB)  reference  shall  be  reset  to  the  initially  predefine  reference  value  Pw0  (or  PB0  )  and  then  update  both  the 
Pw  and  PB  simultaneously.  This  algorithm  can  change  the  threshold  value  dynamically.  It  emphasizes  the 
local  property  and  then  enhances  its  adaptive  capability.  Since  most  of  the  documents  are  white  boundaries, 
the  threshold  calculation  in  the  white  region  is  adopted  as  the  initial  decision. 
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Algorithm  Parameter 


The  following  values  were  used  to  compare  the  results  with  the  fixed  thresholding  techniques. 


o     Fixed  threshold  =  128  gray  level 

o    This  proposed  algorithm 

for  density  variation  target:  Predefined  value  for  Pw  and  PR  =  150  gray  level 
K^  =  Kg  =  60  gray  level 

for  shading  target:     Predefined  value  for  Pw  and  Pg  =  60  gray  level 
=  KB  =  60  gray  level 

for  color  background  target:     Predefined  value  for  Pw  and  Pfi  =  20  gray  level, 
=  Kg  =  60  gray  level. 


Results  and  Discussion 


Figures  5  through  7  compare  the  results  of  this  proposed  adaptive  thresholder  with  the  fixed  thresholding 
(at  50%  contrast)  method.  As  can  be  seen,  the  medium  contrast  areas  are  hardly  detected  for  both  density 
variation  and  color  background  targets  by  a  fixed  thresholder.  Also,  at  40%  to  50%  shading  range,  the  fixed 
thresholding  will  introduce  the  artifact  of  soiid  black  edges.  However,  the  resuits  from  the  proposed 
algorithm  shown  great  improvement  of  detectabiiity .  It  can  detect  not  only  the  0.15  AD  density  information, 
but  also  the  information  on  the  different  color  backgrounds.  The  50%  shading  range  can  also  be  corrected 
by  this  algorithm  without  any  artifact. 

This  algorithm  can  also  heip  image  segmentation  for  pattern  recognition  applications.  It  is  compatible  with 
the  MSI  or  VLSI  and  wili  not  cost  anything  more  to  implement  it  by  the  hardware  design. 
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Figure  5.     The  text  density  variation  target  of  Figure  6.     The  shading  effect  test  target  of 
Figure  1  was  processed  by:   (a)  a  Figure  2  was  processed  by:   (a)  a 

fixed  thresholder  (at  50%  contrast),  fixed  thresholder  (at  50%  contrast), 

(b)  this  new  adaptive  thresholder.  (b)  the  new  adaptive  thresholder. 
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Figure  7.  The  color  backgrounds  targeL  of  Figure  3  was  processed  by:     (a)  a  Fixed  Ihreshoider  (at.  50% 
contrast),  (b)  the  new  adaptive  thresholder. 
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Abstract 

This  paper  introduces  a  new  nonlinear  filter  model  which  has  applications  in  low-level  machine  vision. 
We  show  that  this  model,  which  we  designate  the  normalization  filter,   is  the  basis  for  non-directional, 
multiple  spatial  frequency  channel  resolved  detection  of  image  edge  structure.     We  show  that  the  results 
obtained  in  this  procedure  are  in  close  correspondence  to  the  zero-crossing  sets  of  the  Marr-Hildreth  edge 
detector. 6    By  comparison  to  their  model,  ours  has  the  additional  feature  of  constant-contrast 
thresholding,  viz.,  it  is  spatially  brightness  adaptive.     We  describe  a  highly  efficient  and  flexible 
realization  of  the  normalization  filter  based  on  Burt's  algorithm  for  pyramidal  filtering.1**    We  present 
illustrative  experimental  results  that  we  have  obtained  with  a  computer  implementation  of  this  filter 
design. 

Introduction 

This  paper  reports  a  two-dimensional  filtering  technique  which  has  primary  application  to  image  under- 
standing.    We  have  designed  the  pyramidal  normalization  filter  (PNF),  as  we  designate  our  model,  as  an 
approach  to  robust  contrast-contour  extraction  from  complex  scene  imagery.     To  the  extent  that  we  show  the 
PNF  parallels  certain  behavior  of  the  early  human  visual  system  (HVS),   the  PNF  may  be  viewed  as  a  visual 
model.1     We  informally  present  it  as  such  in  the  context  of  computational  theory  of  vision. ^     In  this  con- 
text,  the  PNF  has  more  general  applications  to  image  data  compression,   image  quality  evaluation,  and  image 
enhancement.     Our  purpose  in  this  paper  is  to  introduce  the  reader  to  the  PNF  model,  argue  its  plausibil- 
ity, and  demonstrate  its  application  to  general  purpose  low-level  image  feature  extraction.     We  more 
explicitly  consider  the  application  of  the  PNF  to  image  coding  and  enhancement  in  a  forthcoming  paper.-*  We 
also  refer  the  reader  to  our  recent  technical  reports  on  the  normalization  filter  and  pyramidal  filtering 
techniques j5 

Description  of  the  pyramidal  normalization  filter 

We  show  a  functional  diagram  of  the  basic  normalization  filter  (NF)  model^  in  Figure  1.     From  its  dia- 
gram and  from  illustrative  sketches  of  Figure  2,   the  reader  should  observe  that  the  NF  is  in  effect  a  non- 
linear bandpass  filter.     The  reader  who  is  familiar  with  recent  work  of  Marr  et.  al.^-^  at  the  MIT 
Artificial  Intelligence  Laboratory  on  models  for  image  edge  detection  will  note  a  similarity  between  the  NF 
edge  response  and  that  expected  of  the  Marr  and  Hildreth  V model. ^    We  subsequently  explore  this 
similarity  in  greater  detail. 

Computational  theory  of  the  normalization  filter 

Our  proposal  of  the  NF  model  recognizes  the  following  properties  of  complex  scene  imagery,  among 
others : 

Edge  information  content.     As  recently  explicated  by  Binford,^  Barrow  and  Tenenbaum,1^  Marr,11  and 
others,  image  contrast-contour  structure  is  rich  in  both  photometric  and  geometric  constraints  on  possible 
scene  interpretation.     Viz.,  such  changes  in  the  image  intensity  surface  correspond  to  projective  scene 
changes  in  illumination,  surface  reflectivity,  and  surface  orientation  and  distance  with  respect  to 
observer.     We  believe  that  a  low-level  visual  representation  should  make  contrast-contour  information 
explicit. 

Localization  of  edge  structure.      From  the  above  description,  one  notes  that  contrast-contour  struc- 
ture is  by  definition  a  spatially  localized  information  source,   the  fidelity  of  which  is  conditioned  on 
having  an  accurate  mechanism  of  detection.     Detection  will  likely  be  subject  to  noise  arising  from  the 
image  forming  process  or/and  endemic  to  the  detector  itself. 

Variable  resolution.     The  visual  field  is  cast  via  a  perspective  image-forming  process  from  the  scene. 
The  relative  scale  of  image  events  is  determined  by  the  proximity  of  their  scene  isomorphs  to  the  viewer 
and  the  relative  size  of  same.     The  edge  detection  process  should  afford  equivalent  performance  to  these 
events,   i.e.,  within  limits  imposed  by  physical  optics,  noise,  and  computation,  should  be  resolution  invar- 
iant. 

The  principal  author  was  a  full-time  faculty  member  of  Brown  University;  he  is  now  employed  by  Honeywell, 
Inc.,  Systems  and  Research  Center,   2600  Ridgway  Parkway,  MN17-2306,  Minneapolis,  MN  55440. 
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Variable  contrast.     As  a  consequence  of  imaging  conditions  and  object  reflectance  properties,  natural 
imagery  usually  possesses  a  large  dynamic  range.     Scene  events  which  are  of  equal  perceptual  importance  may 
give  rise  to  images  of  substantially  different  intensity  and  contour  contrast  values.     Edge  detection  prob- 
lems arise  re.  noise  threshold,  saturation,  and  contrast  adaptation. 

We  now  describe  briefly  how  the  NF  model  addresses  these  issues.     The  input  of  the  NF  is  an  image 
intensity  surface.     We  desire  that  the  NF  signal  the  presence  of  intensity  surface  edges;   the  operational 
principle  for  the  NF  doing  so  should  by  the  previous  arguments  be  noise  immune,  edge  orientation  and  con- 
trast invariant,  and  local  and  adaptive  to  the  scale  of  edge  information  content.     The  need  for  noise  immu- 
nity is  obvious.     Locality  and  orientation-invar iance  of  the  edge  detection  mechanism  recognizes  both  the 
underlying  definition  of  edge  and  that  the  mechanism  of  implementation  is  desirably  simple,  highly  replica- 
ted, and  parallel  in  operation  throughout  the  image  plane.     We  propose  that  a  perceptually  more  meaningful 
goal  than  contrast-invariant  detection  is  contrast-equivalent  detection;  a  given  scene  imaged  under  differ- 
ent lighting  conditions,  or  even  subject  to  variable  lighting  conditions  within,  gives  rise  to  many  edges 
of  similar  contrast  but  considerably  different  absolute  intensities     (e.g.,  as  in  the  former  case,  would 
arise  from  simply  increasing  overall  scene  illuminance).     Forgetting  other  requirements  for  the  minute,  we 
seek  a  filter  somewhat  analogous  to  an  automatic  gain  control  mechanism:  the  response  to  local  signal  chan- 
ges should  be  rendered  independent  of  more  global  signal  fluctuations.     On  reconsideration  of  Figure  2,  the 
reader  will  see  how  the  NF  addresses  this  requirement.     A  linear  filter  point  spread  function  (PSF)  is  de- 
fined whose  characteristic  extent  is  approximately  matched  to  the  anticipated  edge  extent.      The  sensory 
image  is  convolved  with  this  PSF;   the  resultant  image  is  registered  with  the  original  and  divided  into  it 
on  a  point-f or-point  basis,  viz. 


I  (x  v)  =       !i(x»y)  (i) 

°     'V  I^x.y)  *  H(x,y)                           (Cf"   Fi^reS  1  and  2) 

where 

(x,y)  ^  image  intensity  surface  coordinates 

l£    ^  sensory  image  (input) 

H      ^  point  spread  function 


Observe  that  domains  of  the  signal  in  which  the  intensity  surface  changes  slowly  on  the  spatial  scale 
of  H(x,y)  yield  IQ(x,y)  =  1,  viz.,   the  signal  intensity  is  normal ized .     Filter  response  within  a  PSF  extent 
of  an  edge  (approx.)  is  characteristic  of  a  lateral  inhibition  phenomenon,  displaying  the  anticipated  Mach 
band  contrast-contour  emphasis.     Given  circumstances  in  which  intensities  1^,  In  adjacent  to  the  edge  re- 
main approximately  constant  within  the  domain  of  the  PSF,  the  NF  response  is  invariant  for  edges  where  the 
ratio  I),  :   I i  =  constant.     (This  argument  ignores  for  the  minute  possible  edge-to-edge  differences  in  local 
intensity  surface  structure,  of  course.)     Companding  the  NF  output  by  a  natural  logarithmic  transform- 
ation*, viz. 

I. 

Snew)  =  ln  ^(old)  =  ln  riT^H]  (2) 

l 

gives  rise  to  a  new  signal  in  which  the  normalized  background  areas  have  value  zero  and  contrast-contour 
locations  are  demarked  by  zero-crossings  (See  Figure  2).     Note  that  in  order  to  judge  NF  detection  accuracy 
(or  that  of  any  other  filter)  one  must  have  an  edge  model.     For  the  purposes  of  this  study  we  consider  an 
edge  to  be  a  domain  encompassing  an  inflection  of  the  image  intensity  surface. 

As  is  specific  to  the  Marr  and  Hildreth  edge  detection  model  (cf«  ref.  6,  pg.   192),  we  consider  the 
edge  location  to  be  a  peak  in  the  first  directional  derivative  of  the  smoothed  image  intensity  surface, 
equivalently  a  zero-crossing  in  the  second  directional  derivative  of  this  surface: 

D2[H(x,y)  *  Ii(x,y)]  =  0  (3) 

Unangst  and  Schenker  show  the  following  in  a  one-dimensional  analysis  posed  in  ref.  4:  given  that  H  is 
symmetric  and  monotonic  decreasing,  and  given  that  the  edge  is  antisymmetric  about  its  mean  value  at  the 
maximum  gradient  location,   then  the  zero-crossings  of  D^Ch*I]  and  the  NF  are  identical  (within  limits 
specific  to  threshold  detection  in  each  filter).     This  preliminary  result,  which  we  later  illustrate  exper- 
imentally, is  important.     It  infers  that  the  NF,  while  functionally  quite  different  from  a  now  widely  acce- 
pted model  of  edge  detection,  preserves  the  representation  of  that  model  under  a  reasonably  weak 


*  This  and  subsequent  footnotes  follow  the  main  text. 
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constraint.     Before  turning  to  a  discussion  of  the  functional  form  of  H,  we  make  one  further  modification 
of  the  NF  model        I.  *  H 

Io  =  In  [_Ll]  (4) 

h  "  H2 

where 

H^l  r|  )  =  H2U  I  r|  )  k  >  1 

I.e.,   filter  functions  Hi  and  H2  are  identical  other  than  in  scale. 

Logarithmic  decomposition  of  the  image  quotient  in  eq.   (4)  gives: 

IQ  =  lnCli*^]  -  ln[li*H2]  (5) 

When  the  NF  process  is  expressed  in  this  fashion,  both  its  functional  and  physical  character  are  manifest  - 
the  NF  effects  bandpass  filtering  simply  by  differencing  two  companded,  smoothed  replicas  of  the  sensory 
image .** 

Specification  of  point  spread  function  H(x,y).     In  view  of  our  earlier  remarks,  the  filter  function 
H(x,y)  must  suffice  two  requirements: 

1)  reduction  of  intensity  surface  information  content  to  a  characteristic  scale  of  resolution. 

2)  maximal  localization  of  edge  structure  at  a  given  scale  of  resolution. 

In  the  context  of  linear  system  analysis  these  are  conflicting  requirements.     They  argue  for  minimizing 
filter  spatial  variance  at  given  bandwidth;  however,  the  uncertainty  principle  delimits  the  minimum  space- 
bandwidth  product  of  any  function.     These  arguments  have  been  cited  by  Marr  and  Hildreth^  as  reasons  to 
suppose  a  circularly  symmetric  Gaussian  function  for  a  smoothing  filter. 

r2 

G(r)  =  (^2)  ex     (_  — ,)  (6) 
2a 

This  filter  is  optimal  in  the  sense  that  it  produces  maximum  signal  energy  within  a  resolution  interval  of 
specified  width  in  the  vicinity  of  an  edge;2-*  the  Gaussian  is  also  plausible  in  light  of  results  obtained 
in  recent  psychophysical  studies  of  early  HVS  spatial  frequency  mechanisms  by  Wilson  and  Giese^  and  Wilson 
and  Bergen^.     Related  studies  implicate  exponential  filter  functions  or  combined  exponential  -  Gaussian 
envelopes  (Cf.  ref.  1,  pg.  553).     We  propose,   for  the  present,  use  of  the  above  Gaussian  smoothing  filter 
in  the  NF  model.     Our  choice  of  the  Gaussian  is  borne  out  by  our  empirical  findings;  because  the  NF  is 
nonlinear,  results  of  the  Marr-Hildreth  signal  theoretic  argument  summarized  above  do  not  carry  over  to  our 
edge  detector  in  a  strict  sense. 

It  remains  to  address  how  the  NF  model  can  best  detect  edges  of  widely  varying  orientation  and  scale. 
The  former  problem  is  subsumed  by  our  decision  to  use  a  radially  symmetric  smoothing  function.     We  have 
thereby  made  the  NF  a  nondirectional  edge  detector.     Alternatively,  one  could  specify  a  separable  Gaussian 
filter  function  having  different  characteristic  spatial  extent  along  (x,y).^  This  provides  a  directional ly 
selective  detector  at  the  expense  of  requiring  multiple  convolutions  at  any  one  image  location  to  allow  for 
all  possible  edge  orientations.     One  in  principle  trades  off  zero-crossing  location  accuracy  (relative  to  a 
peak  in  the  first  directional  derivative  of  I^  *  Hj^) against  greater  computational  complexity.     In  our 
initial  experimental  studies  we  have  elected  to  use  a  radially  symmetric  smoothing  function.     We  have  sub- 
sequently gained  empirical  evidence  that  the  NF  so  designed  behaves  in  accordance  with  findings  of  Marr  and 
Hildreth  re.  their  nondirectional  (V  2G)  *  l£  edge  detector  model:  given  that  "intensity  variation  in  (G£  * 
I{)  is  linear  along  but  not  necessarily  near  to  a  line  of  zero-crossings,   then  the  zero-crossings  will  be 
detected  and  accurately  located  by  zero  values  of  the  Laplacian  (read,    'radially  symmetric  NF')".^ 


We  have  considered  two  approaches  to  modeling  for  optimal  detector  performance  on  variable  scale  edge 
structure.     In  the  first  approach  we  would  propose  that  the  smoothing  filter  be  locally  tunable  to  param- 
eters of  the  nonstationary  image.     This  approach  presupposes  a  system  theoretic  attitude,  conceivably  re- 
sulting in  an  analytic  or  numerical  two-dimensional  adaptive  filter  design.     A  particular  problem  in  this 
approach  is  that  it  is  as  yet  poorly  understood  how  to  model  image  data  generation  in  any  generality  (Cf. 
ref.   16).     Thus,  robust  system  identification  models,   let  alone  their  efficient  realization,  are  lacking 
for  application  to  the  domain  which  our  paper  assumes.     Alternatively,  and  more  consistent  with  our  phil- 
osophy of  a  computational  theory  of  the  edge  detection  process,  one  may  model  the  NF  not  as  a  single  fil- 
ter, but  as  a  number  of  structurally  identical  filters  acting  at  various  characteristic  bandwidths.  E.g., 
one  envisions  a  filter  bank  whose  channels  are  described  by  a  slight  modification  of  eqs.   (4  and  5): 
I.  *  G 

=  lr>  [r-*G  ]     *  -  0,  1,  2,  3,   ...  (7) 

i  £+1 

=  In     [I.*  G£]  -  in    [I.*G£+1].  (8) 
In  this  model,  the        are  a  set  of  Gaussian  smoothinq  functions    (  eq .  fellows) 
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Gl  ~  exp(-        f  >  (9) 
£ 

wherein  o"2.  may  or  may  not  increment  by  a  characteristic  interval.     One  simple  example,   for  which  we  later 
offer  experimental  illustrations  is  the  case  where        progresses  geometrically: 

a l+l  =  Co ^  Where  constant  C  >  1  (10) 

I.e.,  consider  the  instance  C  =/2:  the  NF  in  effect  becomes  a  set  of  log-companded  contiguous  octave 
spatial  frequency  channels.     One  can  envision  the  output  of  such  a  filter  as  a  vertical  column  of  spatially 
registered  images   (loj^  which  progress  from  the  top  downward  coarse-to-fine  (£=0,1,2,...)  in  their  resolved 
detail . 

Corresponding  to  each  image  in  the  column  are  zero-crossings  to  be  detected  in  the  usual  fashion. 
These  zero-crossings  are  of  themselves  not  necessarily  direct  correlates  of  distinct  events  in  the  scene. 
Indeed,  the  zero-crossings  are  both  redundant  and  ambiguous  in  their  information  content:   the  structure  of 
a  single  edge  is  dispersed  through  multiple  frequency  channels  and  edge  structure  of  distinct  scene  events 
may  be  masked  by  spatial  averaging  effects  of  a  single  frequency  channel.     What  is  needed  is  some  guidance 
as  to  how  to  reduce  and  parse  the  entire  zero-crossing  set  to  a  primitive  representation  that  is  directly 
correlate  to  the  scene.     While  we  do  not  pursue  this  issue  experimentally,  we  do  comment  briefly  on  some 
related  findings  of  Marr  and  Hildreth.^    They  have  observed  that  it  is  basically  one's  notion  of  physical 
constraints  on  the  structure  of  the  visual  world  that  generates  restrictions  on  the  manner  in  which  the 
zero-crossing  set  is  to  be  combined.     Specifically,   the  guidance  one  requires  to  do  so  comes  from  the 
aforementioned  constraint  of  spatial  localization.     Marr  and  Hildreth  embody  this  constraint  in  their 
spatial  coincidence  assumption:     "If  a  zero-crossing  segment  is  present  in  a  set  of  independent  (V^g) 
channels  over  a  contiguous  range  of  sizes  and  the  segment  has  the  same  position  and  orientation  in  each 
channel,  then  the  set  of  zero-crossing  segments  may  be  taken  to  indicate  the  presence  of  an  intensity 
change  in  the  image  that  is  due  to  a  single  physical  phenomenon  (a  change  in  reflectance,  illumination, 
depth,  or  surface  orientation)." 

Implementation  of  the  pyramidal  normalization  filter 

We  have  implemented  the  multi-channel  normalization  filter  on  a  system  consisting  of  a  DEC  11/23  com- 
puter, Grinnell  512  x  512  pixel  display  controller,  high  resolution  video  display,  and  disc  and  magnetic 
tape  storage  devices.     The  system  is  programmed  in  Fortran;  many  of  the  basic  display  operations  are  per- 
formed as  Fortran-callable  machine  code  macros. 

Our  primary  design  goals  for  the  NF  implementation  were  its  flexible  parameterization  and  computa- 
tional speed,  so  as  to  make  the  implementation  a  convenient  experimental  testbed.     The  major  obstacle  to 
achieving  speedy  execution  of  the  NF  operations  are  the  computational  requirements  of  the  multi-channel 
smoothing  filters.     We  have  addressed  this  problem  by  implementing  the  smoothing  operations  via  Burt's 
algorithm.  ^  This  algorithm,  as  illustrated  in  Figure  3,  has  a  pyramid  computational  structure  which 
allows  one  to  perform  the  Gf*I^  computation  in  time  proportional  to  0  [log2  t£^n   (ao/0i)*^]  vs-  °  [CT£  ]""  a£ 
is  the  effective   window  size  of  smoothing  filter  Gl .     As  Figure  3  shows,  Burt's  algorithm  recursively 
generates  discrete  approximations  to  Gi*l^  with  the  convolution  kernel  best  approximating  a  sampled 
Gaussian  in  the  limit  of  £  large.     (implementation  of  Gj.*H^  is  also  approximate  in  the  sense  of  radial 
symmetry:   the  sampled  representation  of  the  pyramid  data  structure  is  Cartesian  as  opposed  to  polar,  thus 
impeding  realization  of  a  radially  symmetric  filtering  operation.)     The  reader  should  note  that  the  Burt 
algorithm  allows  filtering  operations  other  than  Gaussian  by  a  simple  change  of  the  spatial  weighting 
coefficients.     Also,  one  can  realize  novel  filters  by  differencing  of  pyramid  levels.     E.g.,  one  can 
subtract        i*I £  from  Gfc*l£  to  realize  a  "difference  of  Gaussian  (DOG)"  bandpass  filtering  operation.  The 
DOG  operation  is  a  discrete  approximation  to  v2Gg*l£  and  thus  is  useful  to  implementing  the  Marr-Hildreth 
edge  detector  discussed  earlier.     We  refer  to  the  multi-channel  NF  implemented  by  the  Burt  algorithm  as  the 
pyramidal  normalization  filter  (PNF). 

Experimental  Results 

We  now  experimentally  illustrate  some  of  the  ideas  set  forth  in  earlier  sections  of  the  paper.  Our 
purpose  in  showing  these  results  is  demonstrative  as  opposed  to  quantitative  -  we  do  not  claim  that  the 
results  we  present  are  optimal. 

The  results  of  Figure  4  are  derived  from  an  image  taken  from  the  General  Motors  Corp.  "bin-of-parts" 
data  base. 19    v/e  depict  the  original  image,  its  first  four  levels  (£=0-3)  of  representation  in  the  Burt 
pyramid,  a  difference  image  of  pyramid  levels  £=(0,1),  and  the  DOG  zero-crossings  obtained  for  this  dif- 
ference image. 

In  Figure  5,  we  depict  the  zero-crossings  obtained  for  both  the  DOG  and  NF  detectors  operating  at 
image  pyramid  levels  £=(1,2),  viz.,   the  zero-crossings  of  level  £=1.     The  relative  correspondence  of  the 
DOG  and  NF  zero-crossing  results  is  influenced  by  the  detection  thresholds  one  chooses  for  the  two  filter 
models;  we  chose  our  thresholds  to  subjectively  optimize  each  model's  performance.     The  reader  should 
observe  that  the  two  filters  extract  a  large  amount  of  edge  detail  relative  to  the  characteristic  extent  of 
their  underlying  detection  windows. 


102  /  SPIE  Vol.  341  Real  Time  Signal  Processing  V(1982) 


In  Figure  6  we  demonstrate  that  one  may  use  either  the  DOG  or  NF  models  to  extract  image  information 
other  than  the  zero-crossings.     What  we  show  in  Figure  6  is  a  class  of  image  features  that  we  designate  the 
max-min  array. 3>5     This  form  of  image  representation  is  intended  to  make  explicit  image  locations  at  which 
there  is  a  maximum  change  of  gradient  in  the  image  intensity  surface.     Like  the  image  zero-crossings,  these 
features  may  be  extracted  at  characteristic  scales  of  resolution.     The  example  that  we  depict  here  is 
derived  from  DOG  operation  on  levels  2  =  (2,3)  of  the  Burt  pyramid.     For  reasons  that  we  subsequently  make 
clear  in  Figure  6,  we  show  the  max-min  array  as  loci  labeled  with  the  underlying  image  intensity  surface 
values,  rather  than  as  a  binary  image  (something  that  one  can  of  course  also  do  with  the  zero-crossings). 
For  purposes  of  comparison  we  show  the  corresponding  binary  DOG  zero-crossings.     In  ref.  3,  Schenker  and 
Knaak  argue  that  the  set  of  multi-channel  max-min  arrays  set  is  the  basis  for  both  mathematically  complete 
and  perceptually  meaningful  image  representation.     They  report  some  preliminary  experiments  in  which  image 
reconstruction  from  the  multi-channel  max-min  set  is  demonstrated.     We  offer  one  such  example  in  Figure  7 
for  the  reader's  understanding.     While  the  results  shown  therein  were  obtained  with  the  DOG  detector,  the 
NF  detector  is  equally  applicable  to  this  end. 

Summary  and  conclusions 

The  primary  themes  that  we  have  developed  in  this  paper  are:     the  information  one  should  seek  to  ex- 
tract from  a  static  sensory  image,  and  the  manner  in  which  this  is  best  done.     We  examined  the  projective 
optical  transformation  which  takes  scene  into  image;   from  a  physical  understanding  of  this  image-forming 
process  we  set  forth  the  following  premises: 

1)  Loci  in  the  image  intensity  surface  where  the  surface  gradient  (or  its  spatial  derivative)  is  locally 
maximum  are  strong  perceptual  correlates  to  scene  structure. 

2)  Such  changes  in  the  image  intensity  surface  occur  at  a  priori  unpredictable  spatial  orientations. 

3)  Such  changes  in  the  image  intensity  surface  occur  over  widely  varying  spatial  scales. 

4)  The  perceptual  significance  of  such  changes  in  the  image  intensity  surface  is  based  not  in  their 
absolute  values,  but  rather  in  their  values  relative  to  that  of  the  image  intensity  surface  in  their 
spatial  locale. 

These  premises  motivated  our  proposal  of  the  multi-channel  normalization  filter,  a  computational  model 
for  hierarchically  resolved,  nondirectional ,  equi-contrast  threshold  detection  of  the  loci  described  in  1). 
Our  theory  of  image  feature  extraction  is  based  in  premises  quite  similar  to  those  underlying  the  Marr  and 
Hildreth  theory  of  edge  detection."    Our  theory  is  distinct  from  theirs  first  for  its  emphasis  on  bright- 
ness adaptation,  and  second,   for  its  observation  of  the  potential  importance  of  perceptual  information  en- 
coded in  the  max-min  set  as  well  as  the  zero-crossing  set.     We  have  addressed  the  first  issue  in  this  paper 
with  our  proposal  of  the  normalization  filter  mechanism;  we  explore  the  second  issue  in  refs.     3  and  5.  We 
offered  a  simple  illustration  of  normalization  filter  zero-crossing  detection  via  an  approximate  NF  imple- 
mentation based  on  use  of  Burt's  algorithm  ^  for  pyramidal  filtering. 

We  believe  that  the  philosophy  evidenced  in  our  proposal  of  the  normalization  filter  is  significant. 
We  examined  the  physical  transformation  that  takes  scene  into  image.     This  inquiry  led  us  to  conclude  that 
within  the  image  there  exists  a  symbol  set  which  is  a  direct  perceptual  correlate  of  scene  structure. ^0 
This  symbol  set  is  the  information  that  is  presumably  input  to  higher  level  visual  processes  of  spatial 
organization;  e.g.,  shape  recovery  from  gradients  of  texture,   luminance,  stereo  disparity,  and  optical  flow 
(assuming  motion  of  the  observer  or/and  scene).     Beyond  the  work  cited  earlier, 9-H  **  there  has  been  only 
modest  effort  on  the  part  of  the  vision  community  to  understand  the  physical  constraints  on  the  visual 
world  embedded  in  image  edge  structure.     Such  understanding  can  come  only  from  a  concentrated  effort  to 
model  the  image-forming  process  of  real  domains,  experimental  validation  of  such  models,  and  comprehensive 
psychological  investigation  as  to  these  models'   inference  to  the  organization  of  higher  level  processes  of 
perception.     These  pursuits  are  important  not  only  for  our  understanding  of  how  we  see;   they  also  promise 
to  impact  on  our  approaches  to  solving  problems  in  image  data  compression,   image  quality  assessment,  and 
image  enhancement  —  our  ability  to  optimally  transmit,  prepare,  and  restore  image  data  awaits  our 
comprehension  of  exactly  what  information  within  the  image  is  necessary  for  stable  perception  of  the 
physical  world  which  the  image  encodes. 

Footnotes 

*        Readers  familiar  with  modeling  of  the  early  HVS  will  note  that  the  logarithmic  nonlinearity  of  the  NF 
model  appears  at  an  unconventional  location  in  the  signal  flow  relative  to  low-pass  filtering.  By 
contrast,  however,  see  the  work  of  Hall  and  Hall^  and  the  remarks  of  Granrath  (ref.   1,  pg.  554)  on 
this  issue. 

**      Some  readers  may  note  that  the  NF  model  seen  in  this  context  is  log-equivalent  to  an  unsharp  masking 
operation. 13    This  is  reflective  of  the  NF  model's  applicability  to  image  enhancement,  a  topic  we 
briefly  explore  in  our  conclusion. 

***    Cf.  also  Marr  and  Ullman^l  and  Clocksin22  re.  detection  and  interpretation  of  dynamic  edges. 
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(Figure  1):  Functional  diagram  of  the  normalization  filter. 
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(Figure  2):     Illustration  of  normalization  filter  operations. 


SPIE  Vol.  34 1  Real  Time  Signal  Processing  V  (1982)  /  105 


(Figure  3):  Burt's  pyramidal  filtering  algorithm  (after  ref.  18). 


(Left).     Recursive  discrete  spatial  averaging  of  the  intensity  array.     Each  level  of  recursion  halves 
image  bandwidth  and  allows  effective  Nyquist  rata  reduction  by  a  factor  of  two.     The  image  of  any  level 
can  be  interpolated  to  the  original  sample  rate  by  an  analogous  recursive  downward  EXPAND  operation. 
(Cf.  ref.  18) 

(Right).  Level-by-level  evolution  of  the  effective  convolution  kernel  for  the  Burt  filtering  pyramid. 
The  kernel  approaches  a  Gaussian  envelope  in  the  limit  . 


SPATIAL  POSITION  U| 


(Figure  4):     An  example  of  multiple  spatial  frequency  channel  zero-crossing  extraction. 
(Clockwise  from  upper  left) 

Image  from  General  Motors  Corporation  data  base. 19 

Levels  ( 1  =  0-3)  of  the  Gaussian  filtered  image  pyramid.  Spatial  weighting  coefficients  of  the 
Burt  pyramidal  filter  are  (a  =  .4,  b  =  .25,  c  =  .05).  Cf.    Figure  3. 

Difference  image  obtained  by  subtraction  of  level  1  from  level  0  in  Gaussian  filtered  image 
pyramid . 

.        Zero-crossings  detected  in  difference  image. 
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(Figure  5):     A  comparative  example  of  zero-crossing  detection  by  the  Marr-Hildreth  DOG  filter** 

and  the  pyramidal  normalization  filter. 


(By  columns  from  left) 

DOG  filter  zero-crossings  for  level  1  of  Gaussian  filtered  image  pyramid. 

PNF  zero-crossings  for  level  1  of  Gaussian  filtered  image  pyramid. 

(Reading  clockwise  from  upper  left) 

DOG  zero-crossings  registered  on  level  1  image 
Level  1  image 
Level  2  image 
Original  image 

(Reading  clockwise  from  upper  left) 

PNF  zero-crossings  registered  on  ln-companded  level  1  image 
ln-companded  level  1  image 
ln-companded  level  2  image 
Original  image 
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(Figure  6):     A  comparative  example  of  zero-crossing  and  max-min  detection  by  the  DOG  filter. 
(From  left) 

Zero-crossings  detected  for  level  2  of  Gaussian  filtered  image  pyramid. 

Intensity  array  of  max-min  loci  detected  for  level  2  of  Gaussian  filtered  image  pyramid. 


(Figure  7):  An  example  of  image  reconstruction  from  a  max-min  set  representation. 
(Clockwise    from   upper  left) 

Registered  max-min  set  for  levels  (0-3)  of  the  Gaussian  filtered  image  pyramid  (high  threshold). 
Same  (low  threshold). 

Image  reconstruction  obtained  by  superposition  of  the  level-by-level,   two-directional,  maximum 
gradient  interpolations  of  max-min  arrays  (Cf.  ref.  3).     The  max-min  set  'used  in  the 
reconstruction  was  obtained  at  channel  thresholds  intermediate  to  the  above  examples;  effective 
data  compression  is  3.5  (approx.). 


Original 
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Differential  based  gradient  contour  segmentation  algorithm 


John  F.  Gilmore 

Signal  Processing  Department,  Martin  Marietta  Orlando  Aerospace 
P.O.  Box  5847,  MP  304,  Orlando,  Florida  32855 


Abstract 

A  new  technique  of  image  segmentation  using  pixel  differences  is  discussed.     Current  IR 
imaging  seekers  tend  to  be  noisy  and  lead  to  noise-generated  clutter.     Due  to  its  design, 
the  magnitude  contrast  segmenter  reduces  the  amount  of  clutter  generated  during  segmentation 
while  consistently  generating  objects  of  interest. 

The  basic  steps  in  the  algorithm  are  magnitude  difference,   contrast  evaluation  and  edge 
degapping.     The  edges  generated  form  a  closed  boundary  without  using  the  iterative  process- 
ing required  by  other  segmenters .     The  algorithm  also  segments  the  high  and  low  intensity 
areas  of  an  object  into  one  region  and  identifies  the  internal  structure  separating  each.  In- 
termediate    results  are  presented  in  order  to  document  each  step  in  the  algorithm.  The 
final  result  is  a  clutter  reduced,   segmented  image  of  well  defined  regions.     A  diverse  set 
of  images  is  presented  to  demonstrate  the  effectiveness  of  this  algorithm  in  handling  con- 
trastingly different  images. 

Introduction 

The  basic  premise  behind  image  segmentation  is  that  an  imaqe  exhibits  certain  numerical 
characteristics  allowing  it  to  be  subdivided  into  separate  meaningful  regions.     The  ability 
to  accurately  identify  these  regions  would  appear  to  be  rather  simple,  but  this  has  not 
proven  to  be  the  case.  Although  a  variety  of  approaches  have  been  explored,  each  with  its 
own  relative  merits,  efforts  in  this  area  still  continue  since  segmentation  is  a  crucial 
area  in  the  field  of  image  processing.     Only  if  proper  segmentation  is  achieved  can  each 
region  can  be  classified  using  statistical  or  syntactic  methods  based  on  the  region  shape, 
location,   or  internal  characteristics. 

Two  basic  approaches  to  image  segmentation  are  commonly  used:  edge  detection  and  region 
based.     Edge  detection  seeks  to  extract  edges  according  to  the  local  properties  of  an  image; 
Sobel  and  gradient  operators  are  usually  employed;  however,  these  are  often  sensitive  to 
local  noise.     The  edges  produced  are  usually  broken  and  are  difficult  to  use  for  object 
location.     Region  based  methods  exploit  the  global  properties  of  the  images  for  segmenta- 
tion.    Images  may  be  thresholded  using  gray  level  histogram  analysis  which  is  less  noisy. 
A  shortcoming  of  the  global  method  is  that  false  regions  are  often  obtained  along  with  the 
desired  regions  due  to  a  basic  imperfection  of  the  thresholding  method  in  which  everything 
above  threshold  is  segmented. 

The  magnitude  contrast  segmenter  is  unique  in  that  it  falls  between  these  two  categories. 
The  algorithm  is  outlined  in  the  following  section  with  supporting  data  supplied  for  each 
step.     The  merits  of  the  magnitude  contrast  segmenter  are  discussed  and  supporting  results 
are  provided  in  the  last  section.     The  difference  image,   contrast  image,  degapped  image, 
and  the  overlayed  edge  image  are  provided  for  each  test  image  processed.     These  images  con- 
stitute a  visual  derivation  of  the  edge  components. 

Segmentation 

The  magnitude  contrast  segmenter  consists  of  three  steps:  magnitude  difference,  contrast 
evaluation,   and  edge  degapping.     The  image  is  first  processed  by  the  3  by  3  operator  shown 
below.     For  each  pixel  in  the  image  original,  OIMAGE,   the  magnitude  of  the  largest  pixel 
difference  is  calculated  and  placed  into  a  difference  image,   DIMAGE.     For  each  application 
of  the  magnitude  operator,   the  neighboring  Nk  pixels  are  subtracted  from  the  value  of  the 
center  pixel,   Pi j ,   and  the  largest  difference  is  obtained.     The  magnitude  difference  process 
is  outlined  by  the  following  equation: 


N1 

N2 

N3 

N4 

PiJ 

N4 

N6 

IM7 

N8 

DIMAGE  (I,  J)  =  (MAX(ABS(P-Nk)l)  k  =  1,  8 
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Figure  1  is  a  small  test  image  containing  a  diagonal  edge.     Note  that  the  diagonal  is  not 
perfect  in  that  it  is  not  uniformly  stepped.     Figure  2  is  the  result  of  applying  the  magni- 
tude difference  operator  to  the  test  image.     An  intensity  difference  between  the  edge  region 
and  the  homogeneous  regions  it  separates  is  apparent. 
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Figure  1.     Test  Image 


Figure  2.     Difference  Image 


The  next  step  is  to  perform  a  contrast  evaluation  of  the  magnitude  difference  image.  The 
eight  immediate  Dk  neighbors  of  a  centralized  pixel,   Pij ,   are  averaged  to  form  a  border 
contrast  measurement.     This  average  is  reduced  by  the  value  of  the  center  pixel  and  placed 
into  the  edge  image,   EIMAGE,   as  shown  below. 


D1 

D2 

D3 

D4 

PiJ 

D4 

D6 

D7 

D8 

EIMAGE  (I,  J)  =  'k^k  ' 


DIMAGE  (I,  J) 


At  this  stage  three  pixel  types  exist: 

1.  Positive  pixel  values 

2 .  Negative  pixel  values 

3.  Zero  pixel  values. 

Positive  pixels  are  those  which  do  not  contribute  to  the  determination  of  an  edge.     They  do, 
however,   contain  useful  region  information  concerning  the  intensity  distribution  of  both 
local  and  global  areas.     A  small  pixel  value,   typically  a  value  of  five  or  less,  indicates 
the  presence  of  a  homogeneous  region.     A  large  pixel  value  indicates  a  high  variance  within 
a  region.     A  pixel  which  falls  between  these  two  extremes  can  be  interpreted  as  being  in  a 
state  of  transition. 


Negative  pixels,  with  values  of  less  than  -5,   constitute  the  true  edge  elements  in  the 
image  and  are  set  to  255  to  indicate  such.     Those  pixels  greater  than  or  equal  to  -5  are 
converted  to  their  absolute  magnitudes  and  become  an  indication  of  region  homogeneity. 


Figure  3  shows  the  r 
shown  in  Figure  2 .  The 
tains  a  few  values  of  z 
into  one  of  two  categor 
created  by  the  digital 
again  applying  a  3  by  3 
values  equal  to  255)  ar 
the  pixel  EDGE  (I, J)  is 
homogeneous  region) ,  it 
be  applied  to  those  edg 


esult  of  applying  the  contrast  evaluation  box 

edge  is  clearly  defined  by  the  pixel  values 
ero.     Pixel  values  of  zero  are  candidates  for 
ies :   they  are  either    elements  of  a  homogeneous 
properties  of  the  imagery.     The  category  can 

operator.     In  this  application,   the  number  o 
e  summed.     If  the  sum  exceeds  3    (i.e.   Pij  is 
set  to  255.   If  Pij  has  three  or  fewer  edge  ne 
retains  its  zero  pixel  value.     This  edge  deg 
e  values  equal  to  zero.     The  edge  degapping  n 


to  the  difference  image 
of  255,  but  it  also  con- 
edge  degapping  and  fall 
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be  determined  by  once 
f  edge  neighbors  (Dk 
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ighbors    (i.e.   Pij   is  a 
apping  process  need  only 
eighborhood  operator  is 
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composed  of  the  following  steps: 


E1 

E2 

E3 

E4 

PiJ 

E4 

E6 

E7 

E8 

N  =  0 

DO  K  =  1,  8 

IF  (Ek  =  255)  THEN  N=N+1 
FIN 

IF  (N.  GT.  3)  THEN  EDGE  (I,  J)  =  255 


Figure  4  is  the  result  of  applying  the  degapping  routine  to  the  contrast  image  shown  in  Fig- 
ure 3.     Pixels  corresponding  to  homogeneous  areas  are  unchanged  while  those  pixels  surrounded 
by  an  edge  are  absorbed. 


3 

3 

3 

0 

11 

35 

255 

0 

255 

40 

2 

2 

1 

8 

22 

255 

255 

255 

255 

23 

1 

0 

2 

9 

255 

255 

255 

255 

22 

12 

2 

3 

9 

19 

255 

255 

255 

23 

6 

0 

2 

11 

22 

255 

0 

255 

26 

7 

0 

0 

10 

21 

255 

255 

255 

255 

12 

0 

1 

0 

36 

255 

255 

255 

255 

20 

6 

0 

1 

1 

255 

255 

255 

255 

21 

5 

1 

0 

1 

0 

255 

255 

255 

20 

7 

0 

1 

0 

0 

0 

0 

255 

34 

10 

0 

1 

0 

0 

0 

0 

3 

3 

3 

0 

11 

35 

255 

255 

255 

40 

2 

2 

1 

8 

22 

255 

255 

255 

255 

23 

1 

0 

2 

9 

255 

255 

255 

255 

22 

12 

2 

3 

9 

19 

255 

255 

255 

23 

6 

0 

2 

11 

22 

255 

255 

255 

26 

7 

0 

0 

10 

21 

255 

255 

255 

255 

12 

0 

1 

0 

36 

255 

255 

255 

255 

20 

6 

0 

1 

1 

255 

255 

255 

255 

21 

5 

1 

0 

1 

0 

255 

255 

255 

20 

7 

0 

1 

0 

0 

0 

255 

255 

34 

10 

0 

1 

0 

0 

0 

0 

Figure  3.     Contrast  Evaluated  Image 


Figure  4.     Degapped  Image 


In  actual  application,   the  true  edge  falls  between  pixels  rather  than  upon  them.  The 
magnitude  contrast  segmenter  identifies  a  series  of  two-pixel  edge  links  which  contain  the 
edge.     These  links  are  perpendicular  to  and  bisect  the  edge.     Figure  5  is  an  example  of 
these  edge  links  using  the  original  test  image.     The  true  edge,   as  determined  by  the  seg- 
menter,  is  overlayed  onto  the  original  image  and  shown  in  Figure  6. 

Testing  of  this  segmenter  has  yielded  some  interesting  results.     Due  to  its  design,  the 
edge  is  always  identified  by  a  series  of  links.     The  edge  can  be  thought  of  as  being  two 
pixels  thick,  but  in  reality  its  location  is  only  implied.     This  series  of  links  furnishes 
a  closed  boundary  around  a  region  without  any  further  iteration.     This  is  an  important 
feature,   allowing  the  algorithm  to  be  implemented  in  realtime.     A  realtime  capability  is 
sometimes  not  possible  with  other  segmenters  due  to  the  time  expended  in  closing  the  edge 
gaps . 

A  useful  feature  for  many  types  of  military  applications  is  the  ability  to  segment  ob- 
jects which  consist  of  both  high  and  low  intensity  areas  as  compared  to  the  background 
(e.g.   a  tank).     In  this  regard,   the  magnitude  contrast  segmenter  not  only  identifies  the 
absolute  areas  of  the  object,  but  also  generates  an  internal  structure  which  locates  the 
line  of  separation  between  the  internal  areas  of  contrast.     This  type  of  information  is  use- 
ful if  any  kind  of  syntactic  processing  is  to  be  performed  on  the  segmented  image.     As  pre- 
viously mentioned,   regional  information  in  terms  of  homogeneous  measurements  is  also  pro- 
vided and  can  be  used  for  textural  work. 
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Figure  5.     The  Edge  Links 
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Figure  6.     The  Edge  Overlayed  on  the 
Original  Image 


Results 

A  set  of  visual  results  is  provided  to  demonstrate  the  segmentation  ability  of  the  magni 
tude  contrast  segmenter.     Four  synthetic  test  images  were  created  for  this  purpose,  these 
are  shown  in  Figures  7A ,   8A,   9A,   and  10A.     Table  1  identifies  the  mean  and  standard  devia- 
tion of  the  regions  in  each  test  image.     For  each  image  processed,  the  magnitude  difference 
image   (Figures  7B,   8B ,   9B,   and  10B) ,   edge  contrast  image    (Figure  7C ,   8C ,   9C,   and  10C)  and 
the  original  image  with  the  edge  image  overlayed  onto  it    (Figures  7D,   8D,   9D,   and  10D)  are 
presented.     By  viewing  these  intermediate  results  the  evolution  of  the  edge  components  can 
be  visually  observed. 
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Figure  8A.     Synthetic  Test  Image 


Figure  8B .     Magnitude  Difference  Image 


SPIE  Vol  341  Real  Time  Signal  Processing  V  ( 1 982)  /  115 


Figure  9C.     Edge  Contrast  Image  Figure  9D.     Overlayed  Edge  Image 
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Figure  10A. 


Synthetic  Test  Image 


Figure  10B.     Magnitude  Difference  Image 
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Partitioning  and  tearing  applied  to  cellular  array  processing 
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Abstract 

Cellular  arrays  are  regular  structures  of  computing  elements  with  fixed  and  simple 
modes  of  communication  and  control,  they  exhibit  both  parallel  computation  and  pipelined 
data  flow  to  achieve  high  performance  for  the  execution  of  regular  algebraic  operations, 
as  in  matrix  multiplication  and  solution  of  simultaneous  linear  equations. 

This  paper  is  concerned  with  the  use  of  partitioning  and  tearing  algorithms  to  deal 

with     problems  which  are  not  matched  to  the  array  size  or  have  certain  irregularities  in 

structure.  Lack  of     regularity     may     arise     from    a     sparse     model     formulation     or  from 

irregularity  in  data  flow,   caused  by  pivoting  failure  during  elimination. 

We  provide  specific  algorithms  for  stable  solution  of  partitioned  linear  equations, 
without  conventional  pivoting,  and  briefly  discuss  their  use  in  efficiently  handling 
sparse  equation  models. 

Introduction 

In  earlier  papers  [l]-[3],  we  discussed  specific  cellular  array  architectures  and 
algorithms  for  carrying  out  matrix  addition  and  multiplication,  as  well  as  the  solution 
of  linear  equations  by  LU  decomposition  and  ordered  substitution.  This  work  is  similar  in 
spirit  to,  but  differs  in  the  details  of  its  execution  from,  that  of  H.T.Kung  et.  al. 
[4],  and  S.Y.Kung  [5].  A  colleague,  and  others  [6],  have  developed  similar  means  for 
solving  linear  equations  by  orthogonal  decompositions,  using  givens  rotations.  Also 
related  to  this  work  is  the  development  of  means  for  the  singular  value  decomposition  of 
data  arrays,  using  a  Jacobi  iteration  [7].  All  of  these  algorithms  are  regular  in  that 
they  are  implemented  by  an  array  of  processors,  each  of  which  performs  identical 
operations  on  data,  distributed  in  space.  One  such  array  structure  is  shown  in  Figure 
(1) .  Regularity  of  data  flow  is  illustrated,  in  Figures  (2)  and  (3),  for  LU 
decomposition.  See   [l]-[3]    for  details. 

Each  of  these  algorithms  can  become  irregular  due  to  the  occurrence  of  any  of  three 
conditions;  (1)  the  problem  size  is  larger  than  the  array  size;  (2)  pivoting  failure 
occurs  during  an  elimination  operation;  or  (3)  the  problem  is  so  large  and  sparse  that 
efficient  solution  requires  the  exploitation  of  the  problem  structure.  We  discuss  methods 
of  circumventing  these  difficulties  based  on  partitioning  and  tearing  algorithms. 

Tear ing  Linear  Algebraic  Equations 

Tearing  methods  consist  of  finding  ways  of  expressing  problems  as  a  collection  of 
simpler  ones  which,  when  solved  individually  and  recombined  by  some  means,  provide  the 
solution  to  the  original  problem.  We  will  show  how  to  do  this  for  the  solution  of  linear 
algebraic  equations  for  each  of  the  circumstances  causing  irregularity,  cited  above. 

Suppose  that  the  problem  we  wish  to  solve  is  expressed  by  the  simultaneous  linear 
algebraic  equations: 

A  x  =  b 


The  coefficient  matrix  can  be  decomposed,   to  advantage,   in  the  form: 


A  =  A(0)    +  L(l)    R(l)    +  L(2)    R(2)    +   ..   +  L(m)  R(m) 


118  /  SPIE  Vol.  34 1  Real  Time  Signal  Processing  V  (1982) 


A(j)    =  A(j-l)   +  L(j)    R(j),        j  =  1,2, ..,m 

Then,   the  linear  equations  may  be  expressed  as: 

A(j)   x( j)   =  b  ,       j  =  0,1,2,. . ,m 
A(m)    =  A,       x(m)    =  x 

Our  tearing  methods  will  find  ways  of  solving  for  each  of  the  x(j)'s  successively,  by 
means  which  are  easier  or  more  regular  than  the  solution  of  the  original  set. 

We     begin  by  denoting  the  matrix  inverse  of  A(j)   as  B(j).   If  the  inverses  of  each  of 
the  A(j)  exist: 

-1  -1 
B(j)    =  A(j)    =    [A ( j-1 )    +  L(j)  R(j)] 

Using  a  well  known  matrix  identity  [8]: 

-1  -1  -1  -1  -1 

B(j)    =  A(j-l)    -  A(j-l)    L(j)    [I   +  R(j)    A(j-l)    L(j)]      R ( j )  A(j-l) 

-1 

=  B(j-l)    -  B(j-l)    L(j)     [I   +  R(j)    B(j-l)    L(j)]      R(j)  B(j-l) 
Denoting  the  product  B(j)   L(k)    as  V(j,k): 

V(j,k)    =  B(j)  L(k) 

We  arrive  at  the  fundamental  algorithm,   from  which  most  of  our  results  follow: 
=========================FUNDAMENTAL  ALGORITHM========================= 

Initialization: 

A(0)   V(0,k)    =  L(k)      ===>     V(0,k),       k  =  1,2, .  .  ,m  (1) 
Iteration: 

-1 

V(j,k)    =  V(j-l,k)    -  V(j-l,j)    [I  +  R(j)   V(j-l,j)]      R(j)   V(j-l,k)  (2) 
j   =  1,2, ..,m;  k  =  j+1 , j+2 , . . ,m 

-1 

x(j)    =  x(j-l)    -  V(j-l,j)    [I  +  R(j)   V(j-l,j)]      R(j)    x(j-l)  (3) 
j   =  1 ,2 , . . ,m 
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Partitioned  Linear  Equations 


This  algorithm  can  be  applied  to  solve  a  large  set  of  linear  equations  in 
partitioned  form.  That  is,  we  partition  the  coefficient  matrix  A  and  vectors  x  and  b  into 
sizes  which  fit  on  our  cellular  array. 

Note  that  the  only  inversions  required  are  in  the  initialization  of  the  vector 
sequence  V(0,k)   and  the  terms: 

[I   +  R(j)  V(j-l,j)] 

This  latter  term  is  the  same  size  as  the  partition,  which  would  be  no  larger  than  the 
computational  array.  The  initialization  of  V(0,k)  uses  only  the  initial  matrix  A(0).  If 
we  make  this  block  diagonal,  with  blocks  no  larger  than  the  array  then  we  can  solve  the 
whole  problem  in  partitioned  form.  We  illustrate  this  partitioning  process  with  An 
example,   given  below. 
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The  vectors  V(j,k)    are  found  from  equations   (1)   and  (2): 


V(0,1)  = 


0 

0 

0 

0 

1 

0 

0 

1 

0 

0.5 

0.5 

0 

0 

0 . 5 

-0.5 

0 

0 

-0.5 

0 

-.25 

V(0,2) 


0.5  0 

0 

u       U  •  D 

n  ^ 

0  0 

V(0,3)  = 

0 

0  0 

0 

0  0 

0 

0 
0.5 

V(l,2)    =     -0.5     0  V(l,3)    =  0 

-0.5 
-.25 

and  the  x(j)'s  are  found  from  equation  (3): 


x(0) 


0.5 
0.5 

1 

1 
0.5 


x(l)  = 


0.5 
0.5 
0.5 
0.5 
.25 


V(2,3)  = 


0 
1 
0 
-1 
-0.5 


x(2) 


0 
0 
1 
1 
0.5 


x(3) 


These  are  seen,   by  direct  substitution,   to  be  the  solutions  of  the  equations: 

A(0)   x(0)   =  b 
[A(0)    +  L(l)    R(l)]    x(l)    =  b, 
[A(0)    +  L(l)    R(l)    +  L(2)    R(2)]    x(2)    =  b 
A  x(3)    =  b 


Note  that  every  solution  of  linear  equation  sets  in  the  example  may  be  implemented 
by  two  by  two  inversions.  This  occurs  because  we  made  A(0)  into  a  block  diagonal  form 
with  blocks  no  larger  than  two  by  two.  In  fact,  every  operation  may  be  performed  in  two 
by  two  partitions,   in  a  very  direct  manner. 


Proposition:     Partitioned  Solution  of  Linear  Algebraic  Equations 

Any  non-singular  square  set  of  linear  algebraic  equations  can  be 
solved  in  m  by  m  partitions,  with  the  Fundamental  algorithm.  Block  or 
row  pivoting  operations  may  be  required.  m(m+l)/2  applications  of 
equation  (2)   and  m+1  applications  of  equation  (3)   are  required. 
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Decomposition  without  Pivoting 


In  most  sequential  processing  algorithms  it  is  customary  to  use  row  exchanges, 
during  the  course  of  decomposition,  as  needed  to  avoid  premature  termination  or  loss  of 
accuracy  due  to  division  by  small  pivots.  We  find  it  inconvenient  to  exchange  rows  during 
decomposition,  when  using  cellular  arrays.  However,  there  is  another  alternative,  based 
on  equation  augmentation,  and  the  tearing  ideas  discussed  above,  for  stable 
decomposition. 

When  a  pivot  failure  occurs,  we  make  a  new  equation,  with  a  new  variable,  set  equal 
to  the  variable  at  the  position  where  pivot  failure  occurs.  This  new  equation  is  added  to 
the  current  equation,  without  disturbing  the  pipelined  data  flow,  and  without  altering 
the  solution  of  the  equations  being  processed.  Consider  the  linear  system,  of  dimension 
n,  given  in  equation  (4).  Suppose  that  a  pivoting  failure  occurred  at  variable  j.  Then, 
we  form  the  new  equation: 


x(n+l)   =  x(j) 


Adding  this  to  the  jth  equation: 


a(l,l)  x(l)  + 
a(2,l)   x(l)  + 


+     a(l,j)   x(j)     +    +  a(l,n)   x(n)   +  0  x(n+l)     =  b(l) 

+     a(2,j)   x(j)     +    +  a(2,n)   x(n)   +  0  x(n+l)     =  b(2) 


a(j,l)   x(l)   +  ..   +   [a(j,j)   +  1]   x(j)   +  ..  +  a(j,n)   x(n)   -  1  x(n+l)  =  b(j) 

•  •  •  •  • 

a(n,l)   x(l)   +  ..  +     a(n,j)   x(j)     +    +  a(n,n)   x(n)   +  0  x(n+l)  =  b(n) 

0  x(l)   +   ..   +  1  x(j)     +    +  0  x(n)    -  1  x(n+l)  =  0 

The     augmented     equation  set  has  the  same  solution,  and  a  very  special  form,  which  can  be 

partitioned  and  solved  as  in  the  previous  section.  We  illustrate  this  process  with  an 
example. 

The  equation  set  below  is  non-singular,  but  LU  decomposition  cannot  begin  due  to  the 
pivoting  failure   in  the   (1,1)  position. 


We  augment  to  obtain  the  new  system: 
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The  augmentation  algorithm  inadvertantly  made  the  upper  left  two  by  two  partition 
singular,  although  the  whole  system  is  not  singular.  This  is  detected  by  the  occurrence 
of  another  pivot  failure.  We  augment  the  equation  set  again,   to  obtain: 
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We  solve  the  partitioned  set,   as  before,  with  the  results: 


V(0,1)  = 


x(0)  = 


0 

0 

-2 

1 

0 

0 

1 

-1 

V(0,2)  = 

1 

0 

0 

0 

0 

-1 

0 

0 

1 

1 

0 

0 

x(l)  = 

0 

1 

0 

0 

V(l,2)  = 


x(2)  = 


-2  1 

1  -1 

-2  1 

1  -1 


Proposition:     Stable  Decomposition  Without  Row  Exchanges 

LU  decomposition  can  be  executed  with  numerical  stability,  usually 
obtained  through  row  exchanges,  by  augmentation  and  partitioned 
solution.  No  row  exchanges  are  required.  Upon  pivot  failure,  unity, 
(or  some  other  convenient  magnitude)  is  added  to  the  pivoting 
coefficient.  One  additional  row  and  column  are  (implicitly)  added  to 
the  equation  set.  The  decomposition  proceeds  until  the  original  set 
has  been  decomposed.  Then  the  new  partitions  are  accounted  for  as  in 
the  example,  above. 
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Sparse  Equations 


We  have  shown  earlier  [l]-[3]  that  cellular  arrays  can  be  an  effective  tool  for  the 
solution  of  sparse  equations,  provided  that  we  can  reorder  non-zero  coefficient  data  into 
dense  partitions.  Several  efficient  methods  for  such  reorderings  are  known  [1].  The 
result  is  usually  a  bordered  block  diagonal  form,  or  banded  form. 

The  bordered  block  diagonal  form  can  be  partitioned,  as  shown  below.  Then,  a 
partitioned  solution,  requiring  inversions  no  larger  than  the  largest  block  can  be 
computed  by  the  same  means  as  discussed  above. 


A(ll) 

A(22) 

B 

xxxx  . 

C 

A  (n,  n 

A(ll) 


A(22) 


A (n, n) 


0       0     ...     0  I 


I 


Note  that  no  elimination  of  the  zero  valued  terms,  outside  of  the  dense  diagonal 
blocks  or  borders,  is  required.  This  fact  is  responsible  for  dramatic  improvements  in 
computational  efficiency  for   truely  sparse  systems  of  equations. 

Summary 

We  have  presented  tearing  and  partitioning  algorithms,  which  are  intended  to  match 
certain  irregular  problems  with  regular  parallel  and  pipelined  computational  procedures, 
implemented  on  cellular  arrays.  The  fundamental  algorithm  provides  means  for  partitioning 
large  problems  to  fit  on  smaller  arrays,  for  recovering  from  pivoting  failures  during 
elimination,  without  recourse  to  row  exchanges,  and  for  efficiently  solving  sparse 
equations  which  have  been  rearranged  to  bordered  block  or  banded  forms.  A  rank  one 
version  has  been  described,  elsewhere  [1],  for  use  with  a  fast  banding  procedure,  which 
forms  dense  bands  with  a  few  outlying  elements. 
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Figure  1.     Rectangular  Cellular  Array  Processor 
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Figure  2.     LU  Decomposition,  Pass  #1 


Figure  3.     LU  Decomposition,  Pass  //2 
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Abstract 

A  set  of  logical  and  algebraic  image  operations  is  introduced.     These  image  operators 
include  operators  for  binary,   silhouette,   and  grey-scale  imagery.     Simple  illustrations  of 
these  image  operators  are  given.     Examples  of  industrial  inspection  and  robotic  visual 
feedback  problems  are  considered.     Results  of  object  identification  and  defect  locations 
obtained  using  these  image  operations  are  presented. 

Introduction 

American  industry  has  recognized  the  need  for  significant  improvements   in  the 
manufacturing  technology  base.     One  only  has  to  look  at  the  growth  in  exhibits  and 
overflowing  attendance  at  the  recent  Autofact  and  Robots  VI  shows.     Manufacturers  are  using 
or  planning  new  automated  inspection,   non-destructive  testing,   and  automated  assembly  for 
the  productivity  and  quality  assurance  improvements  needed  today. 

A  key  element  of  many  of  these  improvements  will  be  a  machine  vision  system  which 
properly  marries  a  sensor  to  acquire  the  "image"  and  a  processor  to  perform  real  time 
signal  processing.     Such  systems  must  be  capable  of  automatic  recognition  of  parts  and 
detail  for  inspection  and  assembly.     They  must  operate  rapidly  to  meet  and  improve  on 
current  production  line  rates.     The  systems  must  be  adaptable  to  changes  in  part  design, 
inspection  criteria  and  task  assignment.     Manufacturing  environments  dictate  that  the 
system  overcome  the  variability  in  object  positions  and  orientation,    illumination  and 
noise . 

A  number  of  image  processing  systems  are  on  the  market  and  in  military  programs   (1 ) 
which  will  perform  analysis  on  data  from  framing  sensors  such  as  television.     Most  perform 
their  analysis  on  an  image  which  has  been  thresholded  to  create  a  binary  or  silhouette 
image  of  the  objects  of  interest.     Statistical  "features"  of  the  objects  are  derived  and- 
further  analysis  proceeds  in  this  feature  space.     Conventional  general  purpose  (Von 
Neumann)   digital  computers  are  generally  used  with  some  dedicated  front  end  high  speed 
hardware . 

The  Environmental  Research  Institute  of  Michigan   (ERIM)  has  taken  a  different  approach 
to  image  processing  operations  and  processor  architecture.     Our  processing  system  uses 
operations  on  the  image  as  a  whole,   remaining  in  image  space  rather  than  feature  space. 
The  cellular  nature  of  these  operations  allows  us  to  implement  a  computer  architecture 
which  employs  a  high  degree  of  parallelism.     The  Cytocomputer™  is  a  series  of  identical 
programmable  processing  stages  optimized  for  real  time  image  processing.     For  development 
purposes  the  Cytocomputer"  operates  in  our  laboratory  with  display  terminals  and 
interactive  software  which  permit  rapid  algorithm  development  in  a  manner  natural  to  the 
analyst . 

This  paper  will  describe  the  formalism  of  the  neighborhood  processing  algebra  and 
fundamentally  useful  operations  using  cellular  processing.     The  Cytocomputer'"  architecture 
will  be  discussed.     Examples  of  industrial  technology  problems  which  demonstrate  the  use  of 
the  silhouette  and  grey-scale  processing  power  of  the  Cytocomputer'"  will  be  given. 

Image  Algebra 

Image  Algebra  is  a  formal  mathematical  method  which  is  used  to  describe  the 
transformation  of  an  input  image  of  a  cellular  digital  image  processor   (the  Cytocomputer"). 
Programming  the  Cytocomputer"  involves  selecting  the  series  of  transitions  of  the  picture 
elements   (pixels)  of  the  image  based  only  upon  the  state  of  the  pixel  and  the  states  of  its 
eight  nearest  neighbors.     In  many  cases  these  transition  rules  can  be  expressed  as  an  image 
itself,   and  is  then  referred  to  as  a  "reference  image"  or  "probe". 

A  two-dimensional  binary  image  is  a  set  of  points  representing  the  "black"  or  foreground 
points  of  a  typical  black  and  white  image  and  having  a  reference  origin.     Then  given  two 
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such  images  A  and  B  the  usual  set-theoretic  operators  are  defined:     Intersection  AOB,  Union 
AUB,   Difference,  and  Complement.     Addition  and  scaling  can  be  applied  to  the  points. 
Translation  of  an  image   (set)  by  a  point  moves  the  origin  of  the  image  to  that  point. 
Using  translation,   the  new  operators  of  Dilation  and  Erosion  are  defined  as  follows: 

Dilation  A  ©  B  =  U,  .       A,  .  Erosion  A  ©  B  ={p/B -A| 


That   is  the  Dilation  of  A  by  B  is  the  union  of  translations  of  A  by  points  from  B.  The 
Erosion  of  A  by  B  is  the  set  of  points  to  which  B  may  be  translated  and  still  be  contained 
in  A.     Note  that  A  and  B  may  be  arbitrarily  complex. 

Large  complex  images  may  be  composed  and  decomposed  by  a  sequence  of  these  elementary 
operations.     The  image  algebra  includes   identities  and  proofs  which  permit  simplifications. 
For  example  to  compute  the  Erosion  of  A  by  B  where  B  may  be  formed  by  the  dilation  of 
elementary  3x3  images: 

B  =  B1  ©  B2  ©  B3  ©  B4 

It  is  sufficient  to  actually  compute 

A  ©  B  =   ((((A  ©  Bl)  ©  B2)  ©  B3)  ©  B4) 

The  principles  of  Dilation  and  Erosion  may  also  be  extended  to  two-dimensional  images 
which  have  a  grey  value  associated  with  each  pixel.     This  grey  value  may  represent 
photographic  tone,   x-ray  transmission,   radiographic  intensity,   thermal  emission,   or  range 
to  surface.     In  this  sense  the  images  are  regarded  as  three-dimensional  volumes  with  0 
above  and  1   below  the  image  surface. 

Cytocomputer '"  Architecture 

The  most  obvious  implementation  of  a  cellular  processor  is  the  two-dimensional  parallel 
array  of  processors  each  having  communication  to  its  eight  nearest  neighbors.     (Fig  1) 
Processing  of  the  image  stored  in  the  array  consists  of  programming  all  the  individual 
processors  with  a  transition  instruction  which  produces  a  new  pixel  based  on  the  state  of 
the  original  pixel  and  the  state  of  all  the  nearest  neighbors.     Thus  each  instruction 
simultaneously  affects  all  pixels  in  parallel  to  produce  the  new  image.     Each  transition 
may  correspond  to  a  single  primitive  image  algebra  operation.     By  a  series  of  such 
operations,   operations  between  complex  images  are  achieved. 

Naturally  the  processor  must  be  reprogrammed  after  each  transition.     To  truly  process 
the  entire  image  in  parallel  the  array  must  contain  a  processor  for  each  pixel  in  the 
image.     In  the  case  of  the  high  resolution  sensors  expected  to  be  applied  in  industrial 
applications  as  many  as  1024  by  1024  pixels  will  be  generated  necessitating  a  million 
processor  array. 

The  practical  shortcomings  of  parallel  array  image  processors  and  the  need  to  process 
grey-scale  images  from  a  ranging  sensor  led  ERIM  to  develop  an  alternative  parallel 
structure,    the  Cytocomputer'".     Lougheed  and  McCubbrey  have  compared  the  Cytocomputer™  and 
array  cellular  processors  to  show  where  each  is  of  advantage . (2 )     The  Cytocomputer™ 
consists  of  a  serial  pipeline  of  neighborhood  processing  stages  operating  from  a  single 
clock  with  each  stage  performing  a  single  transition  on  the  pixels  of  the  image  as  the 
image   is  pumped  through  the  pipeline.     The  images  pass   through  the  pipeline  as  a  stream  of 
eight  bit  pixels   in  sequential  line  scanned  format.     Following  the  initial  delay  to  fill 
the  pipeline  the  processed  images  exit  at   the  same  rate  they  are  entered.     A  line  scanned 
image  may  be  continuously  processed  as  an  infinite  height  image  and  framed  raster  scan 
images  may  be  processed  sequentially. 

Shift  registers  within  each  stage  store  two  contiguous  scan  lines  and  window  registers 
hold  the  nine  neighborhood  pixels  for  use  by  the  neighborhood  logic  module.  The 
neighborhood  logic  module  performs  the  preprogrammed  transformation  of  the  center  pixel 
based  on  the  values  of  the  nine  pixels.     Biasing,   contribution,   participation,   and  masking 
of  individual  pixels  is  under  program  control  allowing  the  total  possible  transformations 
to  be  very  large.     The  new  center  pixel  value  is  passed  on  to  the  succeeding  stage  while 
the   input  data   is   retained  and  shifted  to  center  the  window  on  the  next  pixel  of  the  image. 

The  transformation  process  may  be  visualized  as  a  3x3  window  moving  across  the  image  as 
shown  in  Figure  2.     The  window  register  shows  the  contents  as  the  scan  reads  in  pixel 
A(6,6).     The  window  contains  all  the  pixel  information  for  the  transformation  of  pixel 
A(5,5).     Figure  3  shows  the  organization  of  the  pipeline  and  a  stage.     The  information  in 
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the  window  register  is  sampled  by  the  neighborhood  logic  module  and  transformed  according 
to  program.     Each  stage  may  be  likewise  regarded  as  a  window  passing  over  the  image 
produced  by  its  preceeding  stage. 

The  Cytocomputer m  architecture  lends  itself  well  to  VLSI   implementation.     The  limited 
number  of  I/O  connections  required  between  stages  fits  well  within  the  pin  count 
constraints  of  single  packages.     As  IC  technology  advances,  more  stages  may  be  combined  in 
a  single  package.     The  large  number  of  identical  parts  per  system  will  provide  the  volume 
necessary  for  economical  manufacture. 

ERIM  has  designed  and  fabricated  two  versions  of  the  Cytocomputer'"  for  laboratory 
research  use  over  the  past  few  years.     Figure  4  shows  the  Cytocomputer™  II   (Cyto  II)  which 
contains  the  pipeline,   a  controller  and  a  high  speed  semiconductor  memory  for  image 
recirculation.     The  processing  unit  occupies  the  central  card  cage  and  the  buffer  memory  is 
directly  above.     The  large  cabinet  can  contain  additional  cages  for  pipeline  expansion. 

Operations  using  cellular  processing 

Cellular  operations  may  be  applied  in  many  useful  ways  to  processing  images.  The 
operations  are  not  restricted  to  the  immediate  3x3  neighborhood  but  concatenate  out  to 
larger  neighborhoods  according  to  the  number  of  stages.     The  grey-scale  processing 
capability  of  the  Cytocomputer™  has  proved  very  useful  in  devising  non  linear  image 
filters.     For  example,   in  our  range  imagery  we  must  contend  with  speckle  noise  which 
manifests   itself  as  pixel  values  grossly  different  than  their  neighbors.     Sometimes  they 
appear  in  small  patterns.     A  series  of  3x3  filters  which  examine  the  data  from  all  or  a 
selected  set  of  the  neighbors  can  detect  these  excursions  and  replace  them  with  more 
reasonable  values. 

Another  type  of  filtering  which  is  analogous  to  low  pass  filtering  is  achieved  by 
erosion  and  dilation  with  grey-scale  reference  images.     Again  the  reference  images  are 
produced  by  a  series  of  operations  by  3x3  windows.     By  control  of  the  neighborhood 
specification  and  pixel  incrementing  the  reference  images  may  assume  a  grey-level  shape. 
Figure  5  is  a  representation  of  a  large  hemisphere  created  in  the  Cytocomputer™.  Filtering 
by  the  use  of  shaped  reference  images  results  in  an  image  which  is  the  surface  constructed 
of  the  combination  of  the  reference  images  which  fits  under  the  original  image.     Such  an 
image  is  smoothed  to  a  degree  dependent  on  the  shape  and  size  of  the  reference  image. 

This  principle  of  filtering  proves  very  useful  in  processing  grey-scale  images  to 
separate  the  foreground  smaller  spatial  objects  from  the  background  which  may  contain 
slowly  varying  levels  and  noise.     Figure  6  is  an  image  of  a  piece  of  text  which  is  unevenly 
illuminated  in  two  directions  and  contains  random  noise.     A  single  level  slice  will  not 
adequately  segment  the  letters.     If  the  image  is  passed  through  a  rolling  ball  filter  where 
the  diameter  of  the  hemisphere  is  selected  somewhat  larger  than  the  letter  element  width, 
the  returned  image  looks  like  Figure  7.     This   is  the  background  of  the  image  and  one  can 
see  the  uneven  background  shading  which  was  apparent  in  the  original.     By  subtracting  this 
image  from  the  original  we  create  a  background  normalized  text  image   (Fig  8)  where  the 
background  is  flat  and  the  letters  all  have  the  same  contrast.     Further  filtering  may  be 
employed  to  clean  up  the  residual  noise  within  the  letters  and  a  single  level  slice  will 
now  produce  a  binary  image. 

Application  to  industrial  technology 

The  Cytocomputer™  may  be  applied  to  a  number  of  machine  vision  applications  requiring 
sensor  controlled  manipulation.     Several  of  these  are  subjects  of  current  research  at  ERIM. 
In  general  whole  image  neighborhood  transform  processing  is  applicable  to  these 
applications : 

*  Manipulation  of  separated  or  non-separated  workpieces  on  conveyors 

*  Parts  Acquisition 

*  Quality  Assurance  Inspection 

*  Manipulation  in  Manufacturing  Processes 

*  Automated  Assembly 

The  manipulation  of  separated  workpieces  on  conveyors   is  an  application  treated  by  other 
machine  vision  systems.     One  advantage  of  the  Cytocomputer™  based  approach  is  in  the  use  of 
grey-level  processing  to  allow  accurate  acquisition  of  the  workpiece  image  under  varying 
lighting  and  background  conditions.     Another  advantage  is  that  the  shape  recognition 
approach  allows  touching  workpieces  to  be  analyzed  as  if  they  were  separated.  The 
processing  throughput  of  the  Cytocomputer™   is  not  effected  by  the  number  of  objects   in  the 
image(2);    in  fact,    the  throughput  can  equal  the  frame  rate  of  the   image  acquisition  sensor 
if  a  full  length  pipeline  is  used. 
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An  example  of  the  Parts  Acquisition  application  of  the  Cytocomputer '"   is   illustrated  in 
Figures  9-13.(4)     Figure  9  illustrates  a  bin  of  parts  in  which  the  parts  are  touching  and 
overlapping  one  another.     The  object  of  the  algorithm  is  to  find  the  parts  which  are  on  top 
and  therefore  available  for  a  robot  arm  to  pick  up.     The  concept  of  "on  top"  here  is  that 
all  feature  elements  of  what  make  up  a  connecting  rod  casting  are  visible  in  this 
grey-scale  image  and  therefore  it  is  not  underneath  another  part.     The  algorithm  proceeds 
as  follows.     The  grey-scale  image  is  regarded  as  a  three-dimensional  surface  where  contrast 
to  a  background  surround  or  another  object     is  an  "edge".     The  edges  of  objects  are 
highlighted  by  dilation  of  "higher"  objects  horizontally  and  differencing  to  find  the 
contrast  difference  edges.     These  edges  are  collected  according  to  orientation  and 
connectivity  by  analysis  with  small  directed  line  segment  reference  images.     The  results 
are  shown  in  Figure  10.     Next  the  areas  between  long  straight  parallel  line  segments  are 
located  by  dilation  and  erosions  for  width  and  length  to  identify  shafts.     Circular  probes 
identify  the  large  and  small  circular  and  elliptical  regions.     These  cues  are  shown  in 
Figure  11.     These  elements  of  what  constitutes  a  connecting  rod,   i.e.,   a  shaft  with  a  large 
and  a  small  circular  end,   are  then  tested  for  proper  relation.     This  is  accomplished  by 
testing  the  shaft  cue  to  identify  its  ends.     Dilating  circularly  followed  by  eroding  the 
circle  cues  to  points  and  dilating  circularly  to  intersect  defines  the  allowable  range  of 
relations   in  a  bit  plane  operation.     The  dilated  cues  and  their  intersection  are  shown  in 
Figure  12.     Figure  13  shows  a  cue  on  the  one  "most  visible"  connecting  rod. 

The  processing  of  images  for  Quality  Assurance  Inspection  applications  may  include 
conventional  grey-scale  television,   x-ray  television,   computer  generated  imagery  such  as 
integrated  circuit  mask  design,  or  3-D  imagery.     An  interesting  example  of  an  inspection 
task  which  combines  the  use  of  shape  recognition  and  3-D  relationships  is  illustrated  in 
Figures  14-18.     Figure  14  is  a  composite  from  a  larger  image  taken  with  ERIM's  3-D  tabletop 
laser  scanner  of  an  engine  head  assembly.     Grey-level  represents  relative  range  to  the 
pixel.     The  processing  objective  is  to  locate  the  valve  stems  and  spring  retainers  to  check 
for  proper  assembly.     An  improperly  assembled  unit  results  in  the  valve  stem  protruding  too 
far  from  the  retainer.     Figure  15  shows  the  results  of  a  small  non-linear  filter  to  remove 
the  single  and  small  pattern  noise.     Next  the  background  is  approximated  by  filtering  with 
a  sliding  disk  or  cylinder  from  below  the  image.     The  circular  surface  is  larger  than  the 
stem  end  and  returns  a  good  approximation  to  the  retainers,   shown  in  Figure  16. 
Subtracting  this  image  from  the  original  normalizes  all  retainers  to  the  same  level,  Figure 
17.     Now  a  slice  through  the  image  at  the  height  of  the  bad  stem  assemblies  will  identify 
all  regions  of  like  difference   (Fig  18).     A  lower  slice  may  be  analyzed  to  find  the  large 
circular  regions,   i.e.,   the  retainers,   and  a  bit  plane  logic  operation  cues  all  stems  in 
the  center  of  retainers  which  exceed  a  certain  height,   Figure  19.     This  analysis  proceeds 
everywhere  in  the  image  therefore  fixturing  or  windowing  are  not  required. 
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Figure  4.     Cytocomputert:m  II 
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Figure  8.     Background  normalized  text 
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Figure  9.     Grey- level  image  Figure  10.     Edge  image 
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Figure  13.     Cue  on  original  image 
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gure  14.     3-D  range  image  of 
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Grey-level  filtering  of 
salt  and  pepper  noise 
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Figure  17.     Background  normalized  image, 
retainers  at  same  levels 


Figure  19.     Combination  of  3-D  level  and 
retainer  center  features  to 
mark  3  bad  assemblies 
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Abstract 

Fast  Fourier  transform  methods  are  attractive  techniques  for  implementing  high  speed 
programmable  filters,   especially  when  flexibility,   accuracy,   and  sharp  filter  transition 
regions  are  important  considerations.     In  this  paper  are  some  design  considerations  in  the 
implementation  of  such  filters.     A  new  high  capacity,  high  speed  FFT  architecture  is  pre- 
sented which  is  being  incorporated  in  a  nearly  completed  development  model  for  a  flexible 
series  of  FFT  processors.     It  is  shown  how  this  processor  may  be  efficiently  embedded  in  an 
overall  programmable  filtering  structure. 

Introduction 

Use  of  the  fast  Fourier  transform   (FFT)   for  performing  digital  filtering  operations  is 
well  established  with  its  first  comprehensive  study  by  Stockham. 1  The  vast  majority  of 
applications  have,  however,  been  restricted  to  situations  in  which  the  filter  to  be  imple- 
mented is  fixed,  or  at  least  need  not  be  altered  very  often.     In  this  paper  we  focus  atten- 
tion on  programmable  type  filters  in  which  one  might  wish  to  change  between  filter  types  on 
a  fairly  regular  basis  and  in  which  complicated,  yet  precisely  defined,   filter  transfer 
functions  are  to  be  implemented. 

Many  requirements  exist  for  such  filters  including  adaptive  equalizers,   interference  re- 
moval  ("excision")    circuitry,   demodulation  of  frequency  hopping  spread  spectrum  type 
communication  signals,   and  multichannel  FM  modulation  and  demodulation.     In  this  paper  we 
have  in  mind  situations  calling  for  real-time  bandwidth  processing  capabilities  of  several 
MHz  or  more  and  filters  requiring  impulse  response  lengths,   and  hence  FFT  sizes,   on  the 
order  of  thousands  to  tens  of  thousands  of  samples.     Restriction  is, made  to  one-dimensional 
filtering,   although  almost  all  results  carry  over  to  filtering  in  two  or  more  dimensions. 

In  the  following  sections  we  first  consider  how  to  design  filters  meeting  specified  cri- 
teria that  are  practical  to  implement.     Next  we  show  an  efficient  system  layout  for  perform- 
ing such  programmable  filtering.     Finally  we  present  a  new  FFT  parallel  architecture  which 
is  particularly  suitable  for  implementing  very  high  speed,   high  capacity  transforms  and 
which  is  currently  being  incorporated  in  a  developmental  model. 

Design  considerations  for  programmable  FFT  based  filters 

Use  of  the  FFT  for  digital  filtering  is  simply  an  efficient  method  for  doing  convolution 
with  finite  impulse  response    (FIR)    filters.     Not  only  does  FFT  techniques  allow  high  speed 
real-time  operation,  but  it  enables  rapid  filter  specification  and  design  in  the  frequency 
domain.     Problems  associated  with  such  frequency  domain  specification  include:      (1)  specify- 
ing the  filter's  transfer  function  in  the  frequency  domain  at  discrete  points  in  order  to 
minimize  the  errors  resulting  between  samples,   i.e.   the  so-called  Gibbs  phenomenon,  (2) 
specifying  frequency  weights  so  that  a  given  section  of  the  impulse  response  contains  a 
series  of  zeroes  in  order  to  avoid  periodic  convolution  problems,   and   (3)   developing  methods 
to  rapidly  determine  filter  weights  without  excessive  computation. 

There  are  three  main  methods  for  designing  FIR  filters:      (1)   windowing  method,    (2)  freq- 
uency sampling  method,   and   (3)    optimal  equiripple  method.     Although  the  latter  two  methods 
enable  filter  designs  with  narrower  transition  widths   (passband  to  stopband)   than  the  win- 
dowing method,  we  show  below  why  the  windowing  method  is  preferable  in  the  application  under 
consideration,   namely  when  the  filter  consists  of  a  series  of  passbands  and  stopbands. 

The  advantages  of  the  windowing  method  over  the  other  methods  for  programmable  FFT  based 
filtering  include:      (1)    simplicity  of  design  via  frequency  domain  methods,    (2)  requirement 
for  storage  of  only  a  small  number  of  frequency  weight  coefficients  while  maintaining  de- 
sired passband  and  stopband  characteristics,   and   (3)   utilization  of  weights  that  are,  for 
the  most  part,   independent  of  locations  of  passbands  and  stopbands  and  even  the  widths  of 
such  bands    (nontrivial  weights  are  only  at  transition  regions) .     These  advantages  are  ob- 
tained by  using  windows  that  have  spectral  sidelobes  sufficiently  low  in  amplitude.  Such 
sidelobe  levels  can  be  chosen  to  be  somewhat  below  the  quantization  accuracy  of  the  FFT 
processor.     However,   some  care  must  be  taken  to  ensure  that  the  window,  which  is  usually 
derived  from  a  time-limited  continuous  time  function,  does  not  undergo  sidelobe  degradation 
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upon  sampling  due  to  aliasing.     This  can  be  done  by  choosing  windows  whose  far  off  spectral 
sidelobes  fall  off  sufficiently  fast,   such  as  60  dB  per  octave  as  in  the  hanning  case.  Al- 
ternatively, one  can  start  with  windows  with  spectral  periodicity — i.e.   those  with  only 
discrete  time  representations,   such  as  the  Dolph-Chebyshev  type. 

The  procedure  that  we  are  suggesting  is  simply  convolution  in  the  frequency  domain  of  a 
desired  frequency  response   (passbands  and  stopbands)  with  the  central  lobe  of  the  spectral 
window.     Given  an  FFT  size  of  N  samples  we  may  wish  to  constrain  the  filter  impulse  response 
length  to  a  fraction  of  this,   say  N/2  or  N/4,   in  order  to  permit  aperiodic  convolution  via 
sectioning  methods  without  the  distortion  associated  with  periodic  convolution  via  the  FFT. 
The  spectral  window  should  reflect  this  time  duration,  but  the  frequency  weights  are  evalu- 
ated at  intervals  consistent  with  the  N  point  FFT.     For  example,   if  we  use  a  hanning 
spectral  window  with  time  duration  T  and  we  wish  this  to  be  one-half  the  section  length, 
then  we  evaluate  the  convolution  of  the  desired  filter  response  with  the  hanning  spectral 
window  at  frequencies  separated  by  1/2T  Hz. 

Since  the  spectral  mainlobe  width  is  constrained,  we  need  only  utilize  filter  weights 
which  are  nontrivial   (i.e.  not  zero  or  unity)   in  the  regions  about  the  transition  points. 
Furthermore,   if  these  transition  points  are  constrained  to  lie  at  multiples  of  1/T,   it  is 
seen  that  these  nontrivial  weights  are  independent  of  the  location  of  such  transition 
points.     Also,   if  the  widths  of  both  passbands  and  stopbands  are  greater  that  the  mainlobe 
of  the  spectral  window,   then  only  one  set  of  weights  need  be  stored  since  the  set  applies  to 
any  transition  point.     If  some  passbands  and  stopbands  are  required  which  are  narrower  than 
the  mainlobe  width,  but  transitions  are  constrained  to  be  at  multiples  of  1/T,   then  we  need 
to  store  additional  sets  of  weights.     The  number  of  such  sets  is  small  since  it  is  bounded 
by  twice  the  number  of  points  in  the  spectral  window's  mainlobe   (here  we  assume  that  both  a 
passband  and  an  adjacent  stopband  are  not  less  than  the  spectral  window's  mainlobe  width). 
For  stopband  rejection  of  about  80  dB  and  data  sections  with  50%  overlap,  one  typically 
needs  about  10  to  15  nontrivial  weights  at  a  filter  transition.     Hence,   this  same  small  set 
of  weights  can  apply  to  literally  thousands  of  different  filter  types. 

The  problem  with  utilizing  frequency  sampling  methods  in  the  present  application  is  that 
the  passband  ripple  is  usually  much  larger  than  the  stopband  ripple  and  is  not  a  control 
factor.     Hence,  one  cannot  set  most  inband  weights  to  unity  without  suffering  severe  distor- 
tion.    Thus,  one  must  carry  a  large  set  of  weights  and  these  will  in  general  vary  from  one 
filter  bandwidth  to  the  next.     One  can  employ  optimal  equiripple  designs  to  guarantee  equal 
passband  and  stopband  ripple  with  both  less  than  any  desired  value.     This  can  lead  to  de- 
signs allowing  setting  of  most  inband  weights  to  unity.     However,   there  is  no  guarantee  that 
the  transition  weights  will  be  constant,   independent  of  filter  bandwidth,   as  they  are  for 
the  window  case   (assuming  sufficiently  wide  passbands).     Furthermore,   computation  of  such 
transition  weights  is  nontrivial.     It  should  also  be  pointed  out  that  although  the  window 
method  produces  larger  transition  widths  than  other  methods,  the  increase  is  fairly  modest 
if  both  passband  and  stopband  ripple  are  equal.3 

Implementation 

Having  agreed  upon  a  method  of  filter  design  requiring  minimal  coefficient  storage,  we 
now  wish  to  look  at  implementation.     The  main  question,   aside  from  the  availability  of  a 
suitable  FFT  processor,   is  how  such  a  processor  may  be  embedded  efficiently  in  an  overall 
architecture  incorporating  segmenting.     For  def initeness ,   let  us  consider  the  case  in  which 
we  wish  to  utilize  one  FFT  processor  for  both  forward  and  inverse  transforms  and  assume  that 
overlap-save  sectioning  is  used1  with  adjacent  sections  of  data  overlapped  by  one-half.  We 
assume  that  the  FFT  processor  is  capable  of  performing  FFTs  and  doing  input/output  opera- 
tions simultaneously,   perhaps  through  use  of  internal  double  buffers  if  necessary.     Figure  1 
shows  how  overlapped  segments  of  data  may  be  fed  to  the  FFT.     Figure  2  shows  an  overall 
architecture  in  which  the  FFT  processor  is  always  kept  busy.     Note  that  an  input  operation, 
an  output  operation,   and  an  FFT  are  always  being  simultaneously  done  and  that  there  are  two 
different  sets  of  these  triples  corresponding  to  even  and  odd  numbered  time  intervals.  Only 
half  the  data  exiting  the  inverse  FFT  is  retained  in  the  overlap-save  operation. 
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Figure  1.     Overlapping  input  buffer    (50%  overlap) 
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Figure  2.     Architecture  for  FFT  filtering   (50%  overlap) 

A  new  parallel  FFT  architecture 

A  new  parallel  FFT  architecture  has  been  developed1*  which  is  particularly  suitable  for 
the  implementation  of  very  high  speed  and  large  size  transforms,  but  is  nonetheless  competi 
tive  with  other  approaches  even  when  large  size  is  not  essential.     The  architecture  has  the 
merit  that  the  bulk  of  the  FFT  processor  is  constructed  from  a  set  of  identical  PC  cards, 
called  "processor  cards".     An  FFT  processor  consists  of  a  control  card,   an  I/O  card,   and  a 
set  of  processor  cards  of  number  1,   2,    4,   8,   or  16    (in  the  present  design).     Adding  more 
cards  increases  both  the  memory  capacity  of  the  overall  system  and  the  speed.     Thus  this 
processor  family  allows  the  user  to  choose  the  amount  of  parallelism  he  wishes  and  yet 
allows  for  future  upgrade. 

The  architecture  is  based  upon  a  parallel  decomposition  of  Singleton's  algorithm  related 
to  several  reported  in  the  literature,  but  different  in  some  fundamental  respects.5'6'7  The 
work  of  Pease     suggests  a  complete  parallel  decomposition  in  which  one  arithmetic  unit  is 
assigned  to  each  butterfly  in  a  given  iteration.     This  is  extremely  complex   (although  very 
high  speed)    for  dealing  with  large  size  arrays.     The  work  of  Corinthios     is  basically  a 
serial  implementation  of  Singleton's  algorithm  and  Krasner8   reports  a  pipelined  version. 
Only  in  Veenkant7  do  we  find  a  decomposition  in  which  a  small  number  of  arithmetic  units 
operate  in  parallel  on  each  iteration.     In  Veenkant7  the  data  was  sectioned  into  M  segments 
each  of  which  consisted  of  N/M  contiguous  samples    (N  point  FFT)    and  each  arithmetic  unit 
operated  independently  on  different  segments.     A  fixed  interconnect  between  butterfly  and 
memory  units  was  illustrated. 

The  current  architecture  is  different  in  that  the  data  is  split  into  M  interlaced  data 
sets  and  M  different  arithmetic  units  are  used  to  process  the  different  sets  of  data. 
Furthermore,   it  has  been  shown  how  memory  units  and  arithmetic  units  may  be  combined  on 
individual  identical  cards,   the  processor  cards,  which  are  connected  to  one  another  by  a 
simple  pattern.     The  distinction  between  interlaced  and  contiguous  data  sets  is  significant 
since  it  turns  out  that  the  number  of  interconnections  required  in  the  former  are  half  that 
of  the  latter.     This  is  a  critical  design  element.     Furthermore,  by  dealing  with  interlaced 
data  it  is  a  simple  matter  to  initially  input  at  high  speed  external  data  to  the  FFT  pro- 
cessor  (basically  a  demultiplexing  operation) .     That  such  a  structure  is  possible  certainly 
seems  plausible,   but  requires  some  investigation  to  determine  that  it  is  in  fact  possible. 

The  author  and  a  colleague  were  led  to  this  structure  since  alternatives  led  to  systems 
requiring  several  different  types  of  cards9  or  utilized  pipelined  techniques .  10  The  latter 
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methods  offered  little  flexibility  in  varying  degree  of  parallelism  and,   futhermore,  could 
not  efficiently  allow  block  floating  point  arithmetic  for  accuracy,  without  introducing  a 
great  deal  of  extra  memory  requirement. 

In  radix-two  implementations  each  processor  card  has  two  input  ports  and  two  output 
ports.     Each  of  the  two  output  ports  are  connected  to  a  given  port  on  the  same  or  another 
card  for  all  time.     The  interconnect  is  stated  as  follows:     one  output  port  on  card  i  is 
connected  to  an  input  port  on  card   (2i)   mod  M  and  the  second  output  port  on  card  i  is  con- 
nected to  an  input  port  on  card   (2i+l)   mod  M  where  i  =  0,1,..., M-l,   and  M  are  the  number  of 
cards   (a  power  of  2) .     This  together  with  proper  routing  of  data  on  each  card  implements  the 
"perfect  shuffle"  required  in  Singleton's  algorithm. 11 

Great  emphasis  has  been  placed  on  high  accuracy,  wide  bandwidth,   and  flexibility  of  oper- 
ation of  the  processor.     In  the  latter  case  the  processor  can  perform  transforms  of  varying 
sizes,   forward  and  inverse  transforms,   simultaneous  FFT  and  I/O  operations  through  cycle 
stealing  techniques,   and  can  permit  either  fully  ordered  or  partially  ordered  outputs  (for 
higher  speed).     Table  1  gives  a  partial  list  of  overall  specifications.     The  unit  is  nearing 
completion  of  development  at  this  time.     A  patent  has  been  awarded  for  the  architecture .  14 

Table  1.     Some  specifications  for  developmental  FFT  processor 
(16  processor  card  version) 

Item  Specification 

Transform  Size  64  to  131,072  complex  points  in  powers  of  2 

Effective  Butterfly  Time  10.4  nsec   (16  butterflies  in  parallel  every  167  nsec) 

System  Clock  167  nsec   (6  MHz) 

Time  per  FFT   (without  I/O)  72  usee  for  1024  complex  points 

6.57  msec  for  65536  complex  points 

Maximum  Continuous  Complex  Sample  13.1  MHz  for  1024  complex  points 

Rate   (Realtime  Bandwidth)  9.39  MHz  for  65536  complex  points 

Arithmetic  Block  floating  point,   16-bit  mantissa   (including  sign) 

plus  5-bit  block  exponent,  conditional  array  scaling 

Size,  Power  22.75"  High  x  19"  Wide  x  24"  Deep,   1000  Watts 

Dynamic  Range  80  dB,  minimum,   peak  tone  to  RMS  noise  floor 

Other  Automatic  self-test  and  fault  isolation 
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Abstract 


Two  optical  adaptive  filter  designs  using  correlation  cancellation  loops  are  described. 
The  experimental  results  and  performance  evaluation  of  a  time-domain  implementation  are 
presented.     They  are  encouraging,   although  limited  by  equipment.     A  frequency-domain 
architecture  is  described,   and  its  projected  performance  is  compared  to  that  of  the  time- 
domain  implementation. 


Introduction 


Adaptive  filtering,   implemented  by  linear  prediction,    finds  application  in  several  areas, 
including  spectral  analysis,   system  modelling,   and  speech  encoding1,2,3.     Because  of  their 
simplicity,   size,   cost,   parallelism,   and  bandwidth,   optical  processors  are  more  advantageous 
for  some  applications  than  electrical  analog  or  digital  processors.     As  a  result,   there  is 
a  desire  for  the  development  of  an  optically  implemented  adaptive  filter. 

A  feasibility  study  of  an  optical  adaptive  filter  has  been  conducted.     An  alternate 
implementation  with  different  characteristics  has  been  designed.     They  have  a  common  basic 
building  block  called  a  correlation  cancellation  loop  (CCL)1. 

The  CCL  approach  to  linear  prediction  will  be  briefly  described  in  the  first  section  of 
this  paper.     An  optical  adaptive  filter  time-domain  implementation   (OAF)    using  the  CCL  is 
then  described.     Experimental  results,   observations,   and  conclusions  of  the  feasibility 
study  performed  with  a  processor  of  this  design  follow.     As  will  be  seen,   the  experimental 
results  were  only  marginally  favorable,   due  primarily  to  a  lack  of  appropriate  equipment 
rather  than  inherent  architectural  flaws.     The  paper  then  presents  -the  architecture  for  an 
optical  adaptive  filter  frequency-domain  implementation    (0AFF1)    using  the  CCL  concept.  The 
advantages  and  disadvantages  of  the  projected  0AFF1  performance  as  compared  to  those  of  the 
OAF  will  conclude  the  paper. 


Linear  prediction  with  correlation  cancellation  loops 


In  this  application  of  linear  prediction,   a  signal,   x(t),   with  time-invariant  auto- 
correlation,  is  approximated  at  current  time  by  a  linear  combination  of  its  equally  spaced 
past  values,    i.e.,   x(t),   the  approximation,    is  given  by 

N 

x(t)   =  anx(t-nT) .  (1) 

n  =  l 


The  input  taps  occur  T  seconds  apart.     The  an  values  are  weighting  factors  for  these  input 
samples,   and  must  be  determined.     One  method  that  produces  optimum  weights    (on  a  least  mean 
square  error  basis)    uses  correlation  cancellation  loops.     A  diagram  of  the  adaptive  linear 
predictor  utilizing  this  method  is  shown  in  Figure  1.     Each  delayed  input  is  correlated  with 
an  error  signal.     The  output  of  the  nth  correlator  is  the  weighting  factor  an  for  the 
corresponding  nth  delay  tap,  i.e., 


an  =  j   e(t')x(t'  -  nT)   dt'  (2) 

These  taps  are  multiplied  by  their  respective  weights,   summed,   and  amplified  to  produce^ 
x(t),   a  predicted  approximation  of  x(t).     The  difference  between  the  signals,   x(t)   and  x(t), 
produces  the  error  signal,   e(t),  which  is  correlated  with  the  delayed  taps.     Thus,   it  is  a 
closed  loop  feedback  system.     It  can  be  shown  that,  when  this  system  reaches  equilibrium, 
the  tap  weights  for  linear  prediction,   as  in  Equation   (1) ,  will  be  determined.  This 
verification  is  not  presented  here,  but  information  concerning  it  may  be  found  in 
references  1,   4,   and  5.     More  mathematical  treatment  of  linear  prediction  may  be  found  in 
references  2  and  3. 
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Optical  adaptive  filter  time-domain  implementation 


The  optical  adaptive  filter  architecture  described  here  was  suggested  by  Douglas  E. 
Brown.     The  tapped  delay  line  is  in  the  form  of  a  Bragg  cell    (see  Figure  2a)   which  allows 
continuous  rather  than  discrete  tapping6/7.     The  maximum  effective  number  of  taps  is  given 
by  the  time-bandwidth  product  of  the  Bragg  cell,  which  is  the  product  of  the  time  aperture 
(the  number  of  seconds  of  data  that  the  Bragg  cell  can  contain)   and  the  Bragg  cell 
bandwidth.     Light  passing  through  a  Bragg  cell  at  position  d  will  be  intensity  modulated 
by  x(t-d/v),  where  x(t)    is  the  electrical  input  to  the  cell's  transducer,   and  v  is  the 
acoustic  velocity  in  the  Bragg  cell  material.     Hence,   in  a  Bragg  cell  of  width  D,  x(t-d/v) 
can  be  optically  processed  for  all  possible  delays,    0  <_  d/v  <_  D/v,   simultaneously.  The 
error  signal  is  introduced  to  the  system  through  an  electrooptic    (EO)   modulator.  Light 
passing  through  this  device  is  intensity  modulated  by  e(t).      (See  Figure  2b.)     The  light  is 
then  collimated  and  spread  horizontally  by  lenses  which  are  not  shown  in  the  figure.  After 
passing  through  the  Bragg  cell,   the  light  intensity  is  modulated  by  the  product, 
x(t-d/v)   e(t).     The  correlation  is  not  complete  until  this  product  is  integrated.     To  do 
this  optically,   a  Hughes  liquid  crystal  light  valve    (LCLV)    is  used.     The  light  from  the 
Bragg  cell  is  imaged  onto  one  face  of  the  LCLV.     The  integral  of  the  product,   x(t-d/v)  e(t), 
is  read  off  the  other  face  by  a  polarized,   collimated  read  beam.     The  polarization  of  the 
read  beam  is  rotated  by  an  amount  proportional  to  the  value  of  the  integral  at  each 
position  on  the  light  valve.     Hence,   the  tap  weight  values  are  represented  by  the 
polarization  of  light  leaving  the  LCLV.     This  light  is  passed  through  an  analyzer,  which 
converts  polarization  to  intensity,   and  then  imaged  onto  a  second  Bragg  cell  which,  like 
the  first,   has  input  x(t).     Light  emerging  from  the  second  Bragg  cell  carries,    for  all 
allowed  values  of  d,   the  products,   a^  x(t-d/v) .     These  weighted  taps  must  now  be  summed  to 
produce  the  approximation,   x(t).     The  summation  is  performed  by  a  photomultiplier  tube 
(PMT)   which  converts  light  intensity  to  electrical  output.     This  output  is  then  amplified 
before  entering  a  difference  amplifier  where  the  subtraction,   x(t)   -  x(t),   is  done 
electrically.     The  output  of  the  difference  amp  is  therefore  e(t),   and  is  the  electrical 
input  to  the  EO  modulator,   as  mentioned  before.     A  block  diagram  of  the  closed  loop  system 
is  shown  in  Figure  3.     Figure  4  is  a  schematic  of  the  optical  processor  itself.  For 
purposes  of  discussion,   let   "horizontal"  and  "vertical"  both  be  perpendicular  to  the  optical 
axis,  with  horizontal  parallel  to  the  plane  of  the  paper  and  vertical  perpendicular  to  it. 
Cylinder  lens  CI  causes  the  collimated  input  light  to  converge  vertically  to  a  horizontal 
line,   so  that  it  can  pass  through  the  first  Bragg  cell  BC1 .     Spherical  lens  SI  images  this 
line  onto  the  input  face  of  the  LCLV.     Undiffracted  light  of  BC1  is  blocked  in  the  SI  focal 
plane,   although  it  is  not  shown  in  the  figure.     Cylinder  C2  brings  the  collimated  read  beam 
to  a  horizontal  line  to  read  the  tap  weights  from  the  output  face  of  the  LCLV.  After 
reflection  from  the  LCLV,   the  light  again  passes  through  C2 ,   and  is  recollimated  vertically 
to  its  original  height.     Spherical  lenses,   S2  and  S3,   have  the  same  focal  length,   and  are 
used  to  image  the  weights  from  the  LCLV  output  to  the  second  Bragg  cell,   BC2 ,  where  the 
weights  and  taps  are  multiplied.     An  analyzer  is  placed  between  lenses  S2  and  S3  at  a  point 
where  the  light  has  converged  enough  to  pass  through  it.     Cylinder  C3  images  the  weights 
back  down  to  a  horizontal  line  so  the  light  can  enter  BC2 .     S4  and  C4  cause  the  light  to 
converge  enough  to  be  collected  by  the  PMT.     In  the  focal  plane  just  before  the  PMT, 
undiffracted  light  of  BC2  is  removed.     Optimum  spatial   filtering  for  this  plane  has  yet  to 
be  determined. 

There  are  several  basic  differences  between  the  operations  performed  by  the  optical 
implementation  and  those  shown  in  the  original  CCL  diagram  in  Figure  1.     This  is  discussed 
in  more  detail  in  reference  5.     One  of  the  most  notable  differences  is  that  light  intensity, 
being  the  square  modulus  of  amplitude,   can  have  only  positive  values.     Another  major 
difference  is  that  the  LCLV  does  not  perform  a  true  integral  from  -°°  to  the  present. 
Instead,   it  performs  a  running  integration  over  an  effective  finite  time,   T' .     These  effects 
compensate  for  each  other  somewhat,   although  the  limited  integration  period  does  prevent 
the  error  signal  from  ever  going  to  zero  and  remaining  there.     The  exclusively  positive 
integral  inputs  would  cause  the  integral  to  increase  without  end,   if  not  for  this  effective 
finite  integration  time.     The  finiteness  allows  the  error  signal  to  reach  a  nonzero 
equilibrium  value. 

Time-domain  implementation  experimental  results 

The  experimental  results  are  most  easily  conveyed  through  photographs  of  the  output 
functions.     One  way  to  see  how  well  the  OAF  works  is  to  view  the  transient  response  of  the 
error  signal,   e(t).     Its  envelope  should  decrease  from  its  initial  value  to  some  equilibrium 
value  as  the  processor  adapts  to  the  input  signal,   x(t).     Another  indicator  of  system 
performance  is  the  spectrum  of  the  error  signal.     The  open  loop  error  signal  spectrum  is 
used  as  a  reference.     When  the  loop  is  open,   x(t)    is  fed  into  the  correlator  in  place  of 
e(t).     In  the  closed  loop  state  the  spectral  components  of  e(t)   should  be  reduced  due  to 
the  subtraction  of  a  predicted  signal,   x(t),    from  x(t). 

The  transient  response  of  e(t)    for  various  loop  gains  is  shown  in  Figure  5.     The  time 
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scale,  which  is  horizontal,   is  compressed  enough  to  reveal  the  behavior  of  the  e(t) 
envelope.     The  input  signal  is  a  single  sinusoid  at  a  frequency  of  approximately  3.6  MHz. 
This  signal  is  turned  on  and  off  at  a  frequency  low  enough  to  reveal  the  transient  behavior 
of  the  signal.     The  loop  gain  is  increased  in  each  successive  picture.     This  is  true  for 
all  figures  containing  photographs  presented  here.     It  is  easily  seen  that,   as  the  gain 
increases,   the  time  taken  for  e(t)    to  converge  to  an  equilibrium  value  decreases.  The 
value  that  is  reached  also  decreases.      (This  residual  error  is  due  to  the  imperfect 
integration  and  improves  as  the  gain  increases . ) 


The  spectra  corresponding  to  the  signals  in  Figure  5  are  shown  in  Figure  6.     Figure  6a 
is  of  the  e(t)    spectrum  at  open  loop  and  provides  the  reference  level  of  0  dB.     The  positive 
and  negative  frequency  components  of  the  sinusoidal  input  are  shown  at  +3.6  MHz.  These 
should  be  rejected  in  the  closed  loop  state.     The  other  frequency  components  appearing  in 
the  spectrum  occur  because  the  x(t)    generator  used  did  not  produce  a  pure  sinusoid.  These 
components  do  not  get  rejected  in  the  closed  loop  state  because  they  are  outside  the 
passband  of  OAF,   as  will  be  discussed  later.     Figures  6b  through  6e  are  for  the  same  input 
signal  and  loop  gain  values  as  Figures  5a  through  5d,   respectively.     As  the  gain  increases, 
the  signal  rejection  increases.     Signal  rejection  of  approximately  20  dB  is  obtained  as 
shown  in  Figure  6e . 

In  Figure  7  x(t)    is  now  a  pulse  train  and  is  shown  on  the  top  trace.     The  bottom  trace 
is  x(t),   the  predicted  signal.     In  each  successive  picture,   Figures  7a  through  7e,  x(t) 
looks  more  like  x(t)   due  to  the  increased  gain.     One  reason  that  the  resulting  x(t)  shown 
in  Figure  7e  is  not  a  better  x(t)    approximation  is  that  not  enough  spectral  components  of 
x(t)   are  within  the  system  bandwidth  to  make  a  sharper  pulse  reproduction  possible.  The 
pictures  in  Figure  8  are  e(t)    spectra  for  a  pulse  train  x(t)    similar  to  the  one  depicted  in 
Figure  7.     Figure  8a  shows  the  open  loop  spectrum.     The  rest  of  Figure  8  is  for  the  closed 
loop  state.     As  the  gain  increases  all  of  the  components  are  rejected  some  amount  except 
the  fifth  one.     Again,   this  is  because  it  is  outside  the  OAF  bandwidth.     In  Figure  8e  signal 
rejection  of  10  dB  is  reached. 


Time-domain  implementation  observations  and  conclusions 


There  were  a  number  of  problems  associated  with  the  particular  lab  implementation 
utilized  here.     Some  were  the  results  of  individual  component  characteristics,  while  others 
were  difficulties  inherent  in  the  architecture.     The  imaging  from  the  first  Bragg  cell  to 
the  second  must  be  one-to-one  so  that  the  tap  weights  line  up  exactly  with  the  taps  in 
size.     The  horizontal  position  of  the  second  Bragg  cell  must  be  adjusted  after  the  correct 
image  size  is  obtained,   so  that  the  tap  weights  multiply  their  corresponding  taps  and  not 
other  taps  shifted  by  some  amount.     These  two  adjustments  are  perhaps  the  most  difficult  to 
make  in  the  physical  construction  of  the  processor,   and  are  major  drawbacks  to  this 
architecture . 


The  liquid  crystal  light  valve  was  designed  to  respond  to  imaging  light  centered  at  525 
nm  in  the  green  part  of  the  visible  spectrum.     Since  a  green  light  source  was  not  available, 
red  light  produced  by  a  HeNe  laser  was  used  on  the  input  side.     The  light  valve  barely 
functioned  with  this  red  input  light,   and  would  have  been  much  more  sensitive  to  a  green 
light  input,   even  if  it  were  of  less  power.     In  order  to  get  an  appreciable  output,  the 
LCLV  was  driven  at  approximately  2.7  KHz  rather  than  the  suggested  10  KHz.     This  reduction 
in  driving  frequency  resulted  in  an  increased  integration  period  T',   and  thus  a  brighter 
output.     T'   increases  as  the  light  valve  driving  frequency  decreases,   down  to  approximately 
100  Hz.     However,    for  frequencies  below  1  KHz  this  is  accompanied  by  a  loss  of  resolution. 
Here  it  should  also  be  pointed  out  that  the  input  light  striking  the  LCLV  was  brought  to  a 
line  in  order  to  increase  the  intensity  by  channeling  the  light  onto  a  smaller  area.  This 
was  necessary  to  achieve  adequate  response  from  the  light  valve.     A  krypton  ion  laser,  also 
a  red  light  source,  was  appropriate  for  use  on  the  output  side  of  the  LCLV.  Another 
complication  occurring  in  the  light  valve  operation  was  that  the  output  information  was 
modulated  by  the  LCLV  driving  frequency.     The  effects  were  seen  as  a  varying  envelope  in 
the  transient  response  pictures.     Note  that  the  signal  photographs  in  Figure  5  seemed 
slightly  blurred.     In  order  to  decrease  the  pictorial  effects  of  the  unwanted  LCLV 
modulation,   the  camera  shutter  was  opened  long  enough  to  record  a  time  average  of  the 
signal,   thus  blurring  it  slightly  but  enhancing  the  e(t)   envelope.     Although  this 
modulation  did  not  prevent  the  OAF  from  functioning  altogether,   it  probably  did  limit  the 
system  performance.     Had  a  green  light  source  been  available,   one  could  have  driven  the 
LCLV  at  a  higher  frequency,   thus  decreasing  this  modulation  effect  significantly. 

There  were  some  instabilities  in  the  system  which  were  probably  due  to  the  long  path 
length  of  the  light.     This  could  have  been  remedied  with  different  lenses.  Thermal 
currents  may  also  have  caused  the  beam  to  wander. 

The  system  functioned  as  a  linear  predictor  over  a  bandwidth  of  approximately  5  MHz, 
limited  by  the  frequency  variation  of  the  phase  difference  between  x(t)   and  x(t).  With 
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perfect  components  and  flawless  alignment  there  would  have  been  no  such  dependence.  The 
single  component  which  frequency  limited  the  system  operation  was  the  EO  modulator  driver. 
It  could  function  only  up  to  10  MHz.     Thus,   because  of  the  processor's  frequency  dependent 
phase  difference,   only  half  the  available  bandwidth  of  the  OAF  was  utilized. 

In  view  of  the  described  difficulties  with  this  laboratory  implementation  of  an  optical 
adaptive  filter,    it  is  obvious  that  many  component  alterations  could  dramatically  improve 
the  system  performance  without  changing  the  basic  architecture.     However,  alternate 
architectures  may  provide  other  benefits,   and  should  be  explored.     One  such  architecture, 
which  has  the  advantage  of  substantially  increasing  the  operation  bandwidth,   is  described 
below . 

Optical  adaptive  filter  frequency-domain  implementation 

The  architecture  described  here  is,   again,   a  suggestion  of  Douglas  E.  Brown.     It  is  a 
frequency  domain  implementation  of  an  optical  adaptive  filter  using  the  CCL  method. 

If  the  approximation,   x(t),   allows  a  linear  combination  of  continuously  sampled  past 
values  of  x(t),   rather  than  discretely  sampled  ones,   then  Equation    (1),    from  linear 
prediction  theory, 

N 

x(t)   =  anx(t-nT)  (1) 

n=l 

becomes 

x(t)   =  J   a(x)   x(t-x)   dt,  (3) 

where  t   is  a  continuous  variable  replacing  the  discrete  variable  nT.     Similarly,  Equation 
(2),    from  the  CCL  theory  of  linear  prediction, 


an  =  J  e(t')  x(t'-nT)  dt',  (2) 
becomes 

a(T)   =  J  e(t')   x(t'-x)   dt'.  (4) 

When  a(x),   as  expressed  in  Equation    (4),   is  substituted  into  Equation    (3),    the  expression 
for  x(t)  becomes 


x(t)    =   I    [I   e(t')   x(t'-T)dt']   x(t-x)   dT.  (5) 

This  is  the  convolution  of  x(t)  with  the  cross-correlation  of  e(t)  and  x(t).  From  the  well- 
known  properties  of  Fourier  transforms,   it  is  seen  that  the  transform  of  Equation    (5)  is 

X(to)    =  X(ui)[E(a>)    X*(u)]    =  E(o>)  |  X(oo)  |  2  .  (6) 

Therefore,   the  inverse  Fourier  transform  of  X(w)    is  the  inverse  Fourier  transform  of 
{E(u>)  |  X  ( to )  |2},  i.e., 

x(t)   =  J   E(u)  |x(u)  |  2  eju)t  du.  (7) 

Note  that  in  order  to  take  the  Fourier  transforms  above,   it  must  be  assumed  that  the 
integrals  are  from  -«  to  +°°,  which,   of  course,   is  not  the  case.     However,   a  simple  computer 
simulation  of  the  frequency-domain  implementation,   including  finite  integration  limits, 
gave  results  which  suggest  the  implementation  would  work. 

It  will  be  shown  that  the  integral  in  Equation   (7)   may  be  approximated  through  the  use  of 
the  frequency-domain  architecture  shown  in  Figure  9.     x(t)   drives  Bragg  cell  BC1,   and  has  a 
Doppler  shifted  Fourier  transform,   X(w)   ejut,   incident  on  the  LCLV   (transform  and  imaging 
lenses  are  not  shown).     The  LCLV  detects  the  squared  modulus  of  the  input  light  amplitude, 
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and  performs  a  running  integration.     The  signal  power  spectrum  is  assumed  to  vary  slowly 
relative  to  the  integration  time  of  the  LCLV  in  this  application,   and  the  integration 
effects  may  therefore  be  ignored.     Because  of  this,   the  polarization  rotation  undergone  by 
the  read  beam  may  be  considered  to  be  approximately  proportional  to  the  squared  modulus  of 
the  input  light  amplitude.     This  LCLV  output  is  passed  through  an  analyzer    (not  shown)  and 
imaged  onto  a  detector.     e(t)   drives  BC2 ,   and  has  its  Doppler  shifted  transform,   E(u>)  ejwt, 
also  incident  on  the  detector  by  means  of  a  beam  splitter.     Thus,   the  input  to  the  detector 
is 

|X(u)  |  2  +  E(u)  eju>t. 

The  detector  sees  the  integral  across  frequency    (since  to  is  the  spatial  variable  at  the 
detector)   of  the  squared  modulus  of  the  input,   which  is 

J    |  |X(u>)  |2  +  E(co)   eja)t|2  dw. 

Expanded,   this  becomes 

J{|e(u>)|2   +   I  X(aJ)  |  4   +  E*  (go)  |  X(oj)  |  2  e"ju)t  +  E  (w)  |  X  (io)  |  2  ejut}  du. 

At  equilibrium,   the  integral  of  the  first  two  terms  is  essentially  constant,   and  may  be 
ignored.     Since  e(t)   and  x(t)   are  real,   each  of  the  last  two  terms  gives  the  same  result 
upon  integration,  specifically, 

J  E(w)  |x(oj)  |2   ejajt  dai. 

Therefore,   according  to  Equation    (7),   the  detector  output  is  within  a  factor  of  two  of 
being  the  inverse  Fourier  transform  of  X(u>),  which  is  x(t).     This  x(t)   is  subtracted  from 
x(t)   electrically  to  produce  the  drive  for  BC2,   and  thus  closes  the  loop.      (Note  that  the 
detector  output  passes  through  an  amplifier  with  variable  gain  before  the  subtraction  takes 
place,   but  this  is  not  shown  in  Figure  9.) 

Comparison  of  architectures 

The  time-domain  implementation  requires  that  the  Bragg  cells  be  operated  linearly  in  . 
intensity.     This  requirement  limits  the  bandwidth  capability  to  about  100  MHz.     A  frequency- 
domain  implementation  that  allows  Bragg  cell  operation  to  be  linear  with  amplitude  may 
utilize  the  full  extent  of  the  bandwidth  available  with  current  Bragg  cells    (^1  GHz).  The 
frequency-domain  implementation  described  here    (OAFF1)    requires  one  less  modulator  than  the 
time-domain  implementation    (OAF) ,   in  that  the  EO  modulator  is  not  used,   nor  is  any 
modulating  device  substituted  for  it.     The  OAFF1  architecture  is  interf erometric  and 
requires  precise  registration  of    |  X  (  cj  )  |  2  and  E(w)   ejwt.     Thus,   it  probably  will  be  no 
easier  to  align  than  OAF.     Other  optical  to  optical  transducers  may  be  substituted  for  the 
LCLV  as  they  become  available.     Shot  noise  due  to  the  detector  output  terms 


{  |E(uO  I2  +    I  X  (oj  )  I  M  du) 


will  reduce  the  S/N  at  the  output. 

Conclusions 

It  is  encouraging  that  an  optical  feedback  system  such  as  OAF  has  produced  as  many 
positive  results  as  it  has.     An  optical  adaptive  filter  does  seem  feasible.     The  OAFF1 
described  here  could  permit  operation  at  a  much  greater  bandwidth  than  the  OAF.  Other 
architectures,   yet  to  be  considered,   may  provide  additional  improvements  and  will  be 
actively  pursued. 

Acknowledgments 

The  authors  wish  to  thank  Jonathan  D.   Cohen  for  consultations,   editing,   and  laboratory 
assistance,   Christine  Gainer  for  the  computer  simulation  of  OAFF1,   and  Joanne  Lantz  for  the 
preparation  of  the  manuscript. 


144  /  SPIE  Vol  34 1  Real  Time  Signal  Processing  V  (1982) 


References 


1.  Morgan,   Dennis  R.   and  Samuel  E.   Craig,    "Real-Time  Adaptive  Prediction  Using  the  Least 
Mean  Square  Gradient  Algorithm,"   IEEE  Trans.   Acoust. ,   Speech,   Signal  Processing,  Vol. 
ASSP-24,   pp.    494-507.     Dec.  1976. 

2.  Makhoul,   J.,    "Linear  Prediction:     A  Tutorial  Review,"  Proc.    IEEE,   Vol.   63,   pp.  561- 
580.     Apr.  1975. 

3.  Makhoul,   J.,    "Spectral  Linear  Prediction:     Properties  and  Applications,"  IEEE  Trans. 
Acoust. ,   Speech,   Signal  Processing,   Vol.   ASSP-23,   pp.   283-296.     June  1975. 

4.  Lucky,    R.W. ,    "Adaptive  Redundancy  Removal  in  Data  Transmission,"  Bell  Systems  Tech 
J^,   pp.    549-573.     Apr.  1968. 

5.  Rhodes,   J.F.,    "Adaptive  Filtering  with  a  Time-Domain  Implementation  Utilizing 
Correlation  Cancellation  Loops,"   To  be  published. 

6.  Korpel,   Adrianus,    " Acousto-Optics  -  A  Review  of  Fundamentals,"  Proc.   IEEE,   Vol.  69, 
pp.   48-53.     Jan.  1981. 

7.  Rhodes,  William  T. ,    " Acousto-Optic  Signal  Processing:     Convolution  and  Correlation," 
Proc.   IEEE,   Vol.    69,   pp.    65-79.     Jan.  1981. 


x(t-NT) 


Bragg 
cell 


Figure  1.  Adaptive  linear  predictor  constructed 
with  correlation  cancellation  loops. 
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Figure  2b   (right) .     Modulation  of 
input  light  intensity  by  both  an 
electrooptic  modulator  and  a  Bragg 
cell,   resulting  in  a  total  light 
modulation  proportional  to  the 
product  of  the  two  separate 
modulations . 
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Figure  3  (left) .  Block  diagram  of  the 
optical  adaptive  filter. 
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Figure  4.     Schematic  of  the  optical  portion  of  the  optical  adaptive  filter. 
Abbreviations  are: 


f:   focal  length 

C:   cylindrical  lens 

S:   spherical  lens 


POL:  polarizer 
COL:  collimator 
BC:   Bragg  cell 


EO:   electrooptic  modulator 

PMT:   photomultiplier  tube 

LCLV:   liquid  crystal  light  valve 
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Figure  5.     Error  signal  transient  response  for  a  3.6  MHz  sinusoidal  input,  x(t) 
Vertical  scale:     100  mV/  div. ,     Horizontal  scale:     500  ps/div. 
Relative  gain  values:      (a)   A,      (b)    (5/2)A,      (c)    5A,      (d)  10A. 
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Figure  6.     Error  signal  spectrum  for  a  3.6  MHz  sinusoidal  input,  x(t). 

Vertical  scale:     10  dB/div. ,     Horizontal  scale:     2  MHz/div. 
Relative  gain  values:      (a)   open  loop,      (b)   A,      (c)    (5/2)A,      (d)    5A,      (e)  10A 
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(a)  (b)  (c)  (d) 


Figure  7.     Top  trace:     x(t),   Bottom  trace:     x(t),   for  an  impulse 
train  input,   x(t).     Top  vertical  scale:      .2  V/div., 
Eottom  vertical  scale:      .lV/div.,   Horizontal  scale:      .5  ys/div. 
Relative  gain  values:      (a)    B,      (b)    2B,      (c)    5B ,      (d)    10B,      (e)  20B 


(e) 


Figure  8.     Error  signal  spectrum  for  an  impulse  train  input,  x(t) 
Vertical  scale:     10  dB/div. ,   Horizontal  scale:     2  MHz/div. 
Relative  gain  values:      (a)   open  Iood,      (b)    B,      (c)    2B,      (d)  5B, 
(e)  10B. 
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Figure  9.     Diagram  of  optical  adaptive  filter  frequency  implementation. 
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Abstract 

An  optical  ambiguity  processor  is  described  which  allows  realtime  processing  of  wideband 
signals.     One-dimensional  acousto-optic  cells  are  used  as  input  transducers  and  no  light- 
to-light  modulators  are  required.     Performance  is  predicted  and  measured.     The  material  in 
this  paper  is  distilled  from  the  author's  master's  thesis  of  spring,   1980  1  . 

Introduction 

Many  signal  processing  problems  require  the  realtime  generation  of  an  ambiguity2  surface, 
that  is,   the  evaluation  of 

AU,t)   =   f  f(t)   g*(t-t)   e"jwt  dt  (1) 


for  a  rectangle  in  the    (u>,t)   plane.     The  ease  with  which  Fourier  transforms  and 
correlations  are  performed  by  optics3/4/5  suggests  an  optical  solution.     This  was 
recognized  a  decade  ago6,   and  the  intervening  years  have  seen  many  optical  ambiguity 
processor  architectures. 

Time-integrating  approaches5'7,  which  make  use  of  a  chirp  transform  algorithm,  produce 
ambiguity  surfaces  with  an  <o  range  which  is  typically  much  smaller  than  the  input  bandwidth, 
and  have  a  correspondingly  finer  frequency  resolution.     When  the  application  requires  an 
output  frequency  span  commensurate  with  the  input  bandwidth,   a  space-integrating 
architecture  is  appropriate.     This  is  the  case  of  interest  here. 

In  1973,   Said  and  Cooper6  demonstrated  a  realtime  ambiguity  processor  which  employed  a 
two-dimensional  acousto-optic  water  cell  as  an  input  transducer.     The  cell  was  driven  by 
the  inputs  f(t)   and  g(t),  which  produced  wide  nonparallel  beams  which  crossed  as  they 
traversed  the  cell.     A  diffraction  term  proportional  to  f (t-x)    g(t-y)   resulted,  where  x  and 
y  were  spatial  coordinates.     A  lens,   transforming  along  the  diagonal,   produced  the  desired 
crossambiguity  function. 

The  Said  and  Cooper  architecture  suffered  from  the  requirement  that  each  acoustic  beam 
needed  to  be  wide  enough  to  accommodate  the  useable  length  of  the  other.     This  is 
impractical  for  reasonable  time-bandwidth  products,   since  high  capacitance  of  the 
necessarily  large  transducers  limits  bandwidth.     Other  two-dimensional  input  transducers 
prohibit  realtime  operation  at  all  but  low  signal  bandwidths.     An  obvious  modification  is 
to  replace  the  two-dimensional  acousto-optic  cell  by  two  one-dimensional  ones  imaged  upon 
each  other,   and  oriented  at  right  angles.     Such  an  architecture  again  requires  the  width  of 
one  cell  to  exceed  the  length  of  the  other,   and  is  impractical. 

This  paper  presents  an  architecture  which  uses  one-dimensional  acousto-optic  (Bragg) 
cells  to  produce  the  realtime  crossambiguity  function  of  two  signals  in  a  manner  similar  to 
the  architecture  of  Said  and  Cooper.     The  signals  may  be  of  large  bandwidth,   good  time- 
bandwidth  products  are  achieved,   no  optical-to-optical  transducers  are  required,   and  the 
full  io  axis  occurs  at  the  output. 

Architecture 

The  processor  may,    for  purposes  of  discussion,   be  sectioned  into  two  stages.     The  first 
stage  is  diagrammed  in  Figure  1.     Light  passes  through  two  orthogonally  arranged  Bragg 
cells  to  reach  an  "image"  plane.     Signals  f  and  g  drive  the  cells.     Light  undiffracted  by 
the  cells  is  blocked,   so  that  the  amplitude  seen  in  the  image  plane  is  related  to  the 
product  of  f  and  g.     Cell  A01  is  imaged  via  lenses  L2  and  L3    (top  view)    so  that  the  x' 
(horizontal)   position  in  the  image  plane  is  illuminated  by  light  emerging  from  a  unique 
point    (x)    in  cell  AOl .     Thus,   horizontal  position  in  the  plane  corresponds  to  the  delay 
applied  to  f (t) .     Lenses  L3  and  L4  image  A02  onto  the  image  plane  along  the  y  axis,   so  that 
vertical  position  y'   in  the  plane  determines  the  delay  applied  to  g(t).     Note  that  the 
transform   (rather  than  an  image)   of  AOl  coincides  with  A02 . 
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The  L3,  L4  lens  system  results  in  a  curved  wavefront  in  the  y'  direction.  For  reasons 
which  will  be  clear  later,    L5  is  inserted  to  remove  the  curvature. 


A  diagram  of  the  image  plane    (less  L5)    is  shown  in  Figure  2.     From  the  preceding 
discussion,   the  light  amplitude  at  position    (x',   y')    is  f(t-x'/v)    g*(t-y'/v),   where  v  is 
the  Bragg  cell  acoustic  velocity*. 

It  is  now  convenient  to  choose  axes  rotated  45  degrees  with  respect  to  x'  and  y'.     As  in 
Figure  2,   let  x'  =   (y'-x')//2v  and  f  =   (y'+x')//2v.     The  t'  axis  represents  running  time, 
while  the  t'  axis  is  differential  time  between  the  two  inputs.     The  image  plane  may  be 
masked  as  shown  in  the  figure  to  make  the  extents  of  the  t"  and  t"  axes  uniform.     In  terms 
of  the  new  coordinates,   light  passing  through  the  plane  has  amplitude 

P(t',x')    =  f  [t-  (t'-T')//2]    g*[t-(t'+x)//2]  .  (2) 

The   "second  stage"  in  the  process  is  diagrammed  in  Figure  3,   and  corresponds  to  the 
approach  taken  in  the  Said  and  Cooper  processor.     Spherical  lens  L6  transforms  the  image 
plane  distribution,  with  the  result  falling  on  a  two-dimensional  detector  array.  Cylinder 
L7  is  placed  one  focal  length  in  front  of  the  detector  and  oriented  with  its  power  parallel 
to  the  x '  axis.     The  result  is  that  the  t'  axis  is  reconstructed  on  the  detector  to  form 
the  x  axis,  while  the  Fourier  transform  of  the  t"  axis  is  retained.     The  transform 
coordinate  is  to.     It  is  because  the  transformation  and  imaging  are  along  axes  rotated  with 
respect  to  the  original  x'  and  y'  that  the  wavefront  curvature    (along  y"  only)   must  be 
corrected  prior  to  transformation. 

For  convenience,   take  the  output ' scaling  to  be  such  that  t  =  /2  t'.     The  Bragg  cells 
have  a  time  aperture  TB,   so  that  the  f  axis  has  a  length  of  T1  =  TB//2.     The  light 
amplitude  at  the  detector  at  time  t  and  position    (t,<jj)  is 


A'(x,to;  t) 


f  (A) 


g   (X-x)   e  j  dX 


(3) 


I  (  T  ,  t  J 


where  the  integration  interval  may  be  taken  to  be 

I(x,t)    =    [t  +  x/2  +  Tj/2,t  +  t/2  -  Tj/2].  (4) 

Comparison  of   (1)   and   (3)    shows  A'(t,u>;   t)    to  be  the  desired  ambiguity  function,  with  the 
exception  of  the  finite  integration  interval. 

The  detector  array  integrates    |A"(t,o>;   t)  | 2  over  a  period  T,  where  device  limitations 
dictate  that  T  >>  Tj  .     For  each    (x,to),    the  processor's  output  is  then 


C(x,u))  = 


f(X)  g*(x-T; 


~  j  10X  -,,12  J4- 

e  J       dX  z  dt . 


(5) 


I (x,t) 


Since    |x|    <<  T,   the  approximation 


C(t,  to) 


f(X)    g*(X-x)  e 


■  jooX 


dX 


dt 


(6) 


T     1(0, t) 


is  appropriate.     The  processor  produces  a  time   "average"  over  the  period  T  of  the  square 
modulus  of  the    (truncated)   ambiguity  function. 

Characteristics 

An  ambiguity  processor  is  most  importantly  characterized  by  the  range  and  resolution  of 
its  x  and  to  axes.     Since  the  t'  and  x'  axes  are  truncated  to  a  length  Ti   =  TB//2,    the  x 
axis  has  a  range  of 

Ax  =   /2   Tx   =  TB  (7) 


while  the  to  axis  has  a  resolution  of 


*  The  complex  conjugate  arises  from  choosing  the  negative  diffraction  order. 
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=  2ir/Ti 


=   2tt  /2/T 


(8) 


as  determined  by  the  reciprocal  relation  of  time  and  frequency.     x  resolution  and  w  range 
are  completely  determined  by  the  requirement  that  the  signals  f  and  g  be  restricted  to  the 
Bragg  cell  bandwidth  B.     Thus,   the  w  range  is 

Au  =  4*B  (9) 
and  the  t  resolution 

5x  -    ""I  "k-B^"1-  <10> 


Thus,   the  processor  resolves  TgB  time  delays  and  /2  TgB  frequency  differences. 

A  quick  examination  of  output  signal-to-noise  ratio    (SNR)    is  in  order.     Consider  the 
situation  where  f(t)    is  a  known  signal  and  g(t)    is  a  received  replica  of  f(t)  with  unknown 
time  delay  t0,   carrier  shift  w   ,   phase,   and  amplitude  and  is  corrupted  by  additive  white 
Gaussian  noise.     The  noise  is  assumed  to  be  a  stationary  circular  process.     For  ease  of 
computation,   only  the  special  case  of   |f (t) |2  constant  will  be  considered.     Define  the 
input  SNR  Pj^  to  be  the  ratio  of  received  signal  power  to  received  noise  power  in  the  band 
of  width  B  admitted  by  the  Bragg  cells.     Let  PQ  be  the  output  SNR  at  the  point  of  maximum 
correlation,   defined  by 

Po  =  E2{C(To'wo)}lno  noise/Var{C(To'uo»}-  ^ 

After  arduous  computation  one  reaches  the  approximation 

PQ  =  P±2  Gx2  G2[l  +  |  G1Pi]"1  (12) 


where 

Gi   =  TjB  =  TBB//2  (13) 


and 


3T/2Tj 


(14) 


Gj  may  be  interpreted  as  the  processing  gain  contributed  by  the  len's  spatial  integration, 
and  G2  the  gain  due  to  detector  integration. 

Expression    (12)    ignores  the  shot  noise  contribution  to  variance.     The  primary  effect  of 
shot  noise  here  is  that  the  output  SNR  is  bounded  by  Ne ,   the  maximum  number  of  electrons 
which  may  be  held  in  a  single  detector  pixel. 

As  the  above  discussion  shows,    the  Bragg  cell  time-bandwidth  product  TgB  is  fundamental 
in  determining  the  processor's  chacteristics .     Typically,   TgB  =  1000  and  rarely  exceeds 
this  value  by  more  than  a  factor  of  2.     Commercial  Bragg  cells  are  available  with  bandwidths 
ranging  from  10  MHz  to  500  MHz  and  with  time  apertures  up  to  50  ps,   subject  to  the  time- 
bandwidth  product  constraint. 

Since  two-dimensional  detector  array  development  has  been  driven  by  the  video  industry, 
detector  integration  time  rarely  deviates  from  standard  video  frame  periods,   and  then  not 
by  very  much.     The  pixel  capacity  Ne  of  two-dimensional  detectors  is  currently  limited  to 
approximately  2    •   105  electrons.     Finally,   the  number  of  pixels  ultimately  determines  how 
many    (t,cd)   points  may  be  examined  and  is  presently  limited  to  about  2   •  105. 

Experimental  Results 

An  ambiguity  processor  was  assembled  using  two  30  us  slow  shear  wave  Bragg  cells.  Their 
20  MHz  bandwidth  offered  a  time-bandwidth  product  of  600.     A  silicon  target  vidicon  with  an 
integration  period  of  3  3  ms  acted  as  the  detector  array.     Interchangeable  lenses  in  the 
"second  stage"  optics  allowed  a  portion  of  the  t  axis  to  be  expanded,    filling  the  detector. 
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The  test  setup  included  one  of  several  signal  generators  which  acted  as  the  reference 
signal  f(t)   and  had  another  output  which  could  be  delayed  and  shifted  in  frequency  to 
simulate  a  received  signal  g(t).     Bandlimited  white  noise  could  be  added  to  the  "received" 
signal  to  achieve  any  SNR.       As  well  as  a  video  monitor,   a  scope  was  connected  to  the 
detector  output  to  measure  output  SNR. 

Figure  4   is  a  photograph  taken  from  the  video  monitor  showing  the  classic  autoambiguity 
function   (f(t)   =  g(t))   of  a  pulse  train.     Here,   the  pulse  duration  was  1  us  with  repetition 
period  2.5  ys.     This  clearly  demonstrated  the  processor  was  functional. 

The  pulse  generator  was  replaced  by  a  pseudo-noise  sequence  generator,  which  produced 
one  narrow  peak  in  the  output.     The  peak  could  be  moved  by  changing  the  time  and  frequency 
shifts.     Noise  was  introduced  with  the   "received"  signal  at  various  power  levels  and  with 
two  bandwidths.     For  each  input  SNR,   the  scope  was  used  to  take  a  large  number  of 
measurements  of  the  correlation  peak.     Sample  mean  and  variance  were  used  to  estimate 
output  signal  and  noise  power.     Figure  5  compares  estimated  and  predicted  SNR, 
demonstrating  excellent  agreement. 


Conclusion 


The  processor  described  here  provides  a  solution  to  realtime  estimation  of  time  and 
frequency  of  arrival  of  wideband  signals  when  a  frequency  span  on  the  order  of  bandwidth 
is  required.     Experiments  have  demonstrated  the  feasibility  of  the  concept  and  support 
performance  predictions. 
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Figure  1.     Generating  the  image  plane.      (Undif f racted  light  not  shown.) 
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Figure  2.     The  image  plane.     The  t'   axis  Figure  3.     Two  views  of  the  second  stage, 

is  running  time  while  the  x'   axis  is  The  x'    axis  is  imaged  while  transformation 

differential  time.     The  shaded  area  is  occurs  along  the  t'  axis, 

masked  so  that  the  x1   and  t'   axes  have 
uniform  length. 


Pulse  width  1  ys,   repetition  period  2.5  ~30  ~20  ~1°  0 

ys.     to  axis  is  horizontal.  INPUT  SNR  (dB) 

Figure  5.     Predicted  and  measured  values 
of  output  SNR  P0  as  a  function  of  input 
SNR  Pj^. 
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ABSTRACT 

A  new  application  of  the  triple  product  processor  in  the  self -synchronization  of  long 
spread  spectrum  codes  with  high  processing  gain  and  a  large  range  delay  search  is  described. 
The  technique  presented  is  useful  for  fine  ranging,  message  decoding  and  C-^I  applications. 

1.  INTRODUCTION 

The  triple  product  processor    (TPP)    is  a  well-known  acousto-optic    (AO)    signal  processing 
architecture   [1-2].     In  this  system,   the  time-history  output  from  an  LED,    laser  diode  or 
point  modulator  is  imaged  through  a  horizontally-oriented  AO  cell,  whose  output  is  then  com- 
pressed and  used  to  illuminate  a  second  vertically-oriented  AO  cell.     Both  AO  cells  are 
imaged  onto  a  2-D  output  detector  array  where  the  output  is  time-integrated.     With  input 
signals  sg(t),   s^(t)    and  S2(t),   this  classic  optical  signal  processor  produces  an  output 
described  by 

s(t1,t2)   =  /sQ(t)s1(t  -  x1)s2(t  -  i2)dt,  (1) 

and  hence  the  name  triple  product  processor  is  given  to  this  system.     This  architecture  is 
quite  attractive  because  it  provides  a  2-D  output  signal  function  even  though  it  uses  only 
1-D  AO  devices.     It  is  quite  flexible,   since  if  the  input  signals  are  properly  chosen,  the 
system  can  produce  either  an  ambiguity  function  output  or  a  2-D  chirp-Z  transform  folded- 
spectrum  output. 

In  this  paper,  we  demonstrate  yet  another  use  of  this  system:     the  self-synchronization 
and  demodulation  of  a  spread  spectrum  signal  with  large  processing  gain  and  a  large  range 
delay  search.     In  Section  2,  we  highlight  the.  results  of  a  statistical  analysis  of  this  sys- 
tem with  attention  to  the  output  SNRq  and  the  processing  gain  provided  for  narrow-band  and 
wide-band  noise  jammers.     As  we  show,   the  time-integration  of  a  long  code  is  preferable  to 
the  integration  of  a  short  code  when  narrow-band  noise  is  present.     As  the  specific  coded 
waveform  to  be  considered,  we  chose  a  product  code.     We  describe  such  a  code  and  its  attrac- 
tive properties  in  Section  3.     In  our  experimental  verification,  we  used  a  direct  sequence 
code,   even  though  the  processing  technique  is  appropriate  for  other  types  of  coded  waveforms 
for  ranging,   communications  and  C^I  applications.     In  Section  4,  we  describe  how  the  TPP  can 
be  used  for  processing  such  signals.     In  Section  5,  we  discuss  selection  of  the  codes  used 
and  we  highlight  the  results  of  our  initial  simulated  experiments. 

We  denote  the  spread  spectrum  modulated  code  by  s (t) .     For  the  case  of  a  pseudorandom 
(PR)    code,   s(t)    is  a  biphase  PR  sequence  of  Ns  bits,   each  of  chip  duration  Tp  with  s(t)  = 
+1  for  each  chip.     A  long  Ns-bit  sequence  can  represent  the  coded  transmittance  of  a  "1"  or 
"0"  bit  of  information.      (The  complement  of  s(t)   can  be  used  to  represent  a   "0"  bit  of  data). 
In  a  ranging  application,   the  code  is  repeated  and  its  synchronization  and  demodulation  data 
are  used  to  provide  fine  range  information   (e.g.   the  GPS  Navstar  system   [3,4]).     In  the  GPS 
system,   information  is  also  transmitted  on  the  same  channel.     We  will  concentrate  on  the  de- 
modulation and  synchronization  of  the  s (t)    signal,   since  this  is  the  major  operation  required 
in  the  receiver  and  the  one  for  which  optical  processing  techniques  appear  most  attractive. 
We  denote  the  period  of  the  signal  code  by  Ts  =  NST„,   the  signal  time  bandwidth  product  by 
STBW  =  TgB    (where  B  =  1/Tp  is  the  bandwidth  of  the  code),   and  the  system's  or  processor's 
integration  time  bandwidth  product  by  ITBW  =  TjB   (where  Tj  is  the  integration  time  over 
which  the  processor  integrates  the  signal).     For  the  cases  we  consider,   a  large  ITBW  is 
necessary  to  achieve  a  large  processing  gain  performance,   but  the  range  delay  in  the  received 
signal  is  not  known  and  can  be  very  long.     We  must  thus  search  each  range  delay  and  for  each 
we  must  perform  a  long  time-integration  to  achieve  the  desired  processing  gain  performance. 

2.      PROCESSING  GAIN  AND  NOISE  JAMMERS 

In  the  GPS  system   [3,4] ,   a  long  and  a  short  PR  code  are  simultaneously  present  on  quad- 
rature carriers.     The  short  code    (A/C  code)    is  a  1023  bit  PR  sequence  at  1MHz  and  the  long- 
er code    (P  code)    is  transmitted  at  a  10Mbps  rate.     The  A/C  code  is  used  to  allow  fast  syn- 
chronization of  the  system.     It  is  also  used  to  synchronize  the  P  code  to  provide  high  reso- 
lution accuracy  and  a  large  processing  gain   (PG) .     In  this  section,  we  highlight  the  PG  per- 
formance of  a  long  PR  code  and  a  shorter  repeated  PR  code,   both  with  the  same  integration 
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time  and  bandwidth  in  the  presence  of  wide-band  and  narrow-band  noise  jammers.     We  assume  a 
stationary  process  with  the  levels  of  each  bit  in  the  direct  sequence  equally  probable  and 
statistically  independent.     We  denote  the  statistical  auto-correlation  of  such  a  code  by 
Rs(t),   and  note  that  it  is  a  triangular  function  of  width  Tp  and  peak  height  Ps    (the  power 
in  the  signal).     We  denote  the  input  noise  to  the  receiver  by  an  additive,   stationary,  zero- 
mean,  Gaussian  process  with  an  auto-correlation  function 

Rn(T)   =  Pnexp(-x2/2r2) ,  (2) 

where  PR  is  the  total  noise  power  and  r  is  the  correlation  width.  The  noise  power  Png  with- 
in the  signal  bandwidth    (+0.5Tp)    is  described  by 

Pn0  =  pnt2erf  t'A)    "  U  '  (3) 

where  Y  =  Bn/Bs  is  the  ratio  of  the  bandwidth  of  the  noise  to  the  bandwidth  of  the  signal. 
We  will  use  y  in  our  comparison  of  the  performance  of  the  different  processors.     A  small 
value  of  y  corresponds  to  a  narrow-band  noise  jammer  and  a  large  y  value  corresponds  to  a 
wide-band  noise  jammer. 

We   [5]   derived  expressions  for  the  mean  and  variance  of  the  correlation  function  when  in- 
put noise  was  present.     We  neglected  the  variance  contribution  to  the  correlation  output  due 
to  the  sidelobe  levels  of  the  coded  waveform  itself  and  considered  the  contribution  to  the 
variance  of  the  correlation  due  only  to  the  noise   (this  is  realistic  and  appropriate  for  the 
cases  of  large  time  bandwidth  product  signals  and  high  jammer  noise  with  which  we  are  con- 
cerned) .     The  processing  gain 


PG  =  SNRq/SNF^  (4) 

for  our  signal  and  noise  model  was  derived  and  computed.     For  the  case  of  narrow-band  noise 
(y  =  0) ,  we  find  PG  =  STBW.     For  wide-band  noise    (y  =  «) ,  we  find  PG  =  ITBW.     In  Figure  1, 
we  show  the  normalized  processing  gain  PG/ITBW  in  dB  versus  ITBW/STBW.     The  PG  cannot  exceed 
ITBW  and  thus  the  vertical-axis  in  Figure  1  represents  the  loss  that  is  obtained  for  various 
types  of  noise    (y  =  0  to  °°)   as  a  function  of  the  number  of  signal  periods  within  the  sys- 
tem's integration  time    ( ITBW/STBW) .     The  data  in  Figure  1  are  plotted  for  the  case  of  a 
fixed  ITBW  =  40,000  and  a  fixed  Bs  =  1/Tp. 


I  TBW/STBW 

FIGURE  1  Normalized  processing  gain  (PG)  versus  the  number  of 
cycles  of  the  signal  integrated  as  a  function  of  the 
bandwidth  y  of  the  noise  jammers  present. 
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Our  intention  in  Figure  1  is  to  quantify  the  different  PG  that  results  when  a  shorter 
code  is  integrated  several  times  compared  to  the  integration  of  a  longer  code   (with  the  same 
total  Tj  and  Bs  used  in  both  cases)  .     As  we  see  in  Figure  1,  with  wide-band  noise    (y  =  <*)  , 
no  PG  difference  occurs  whether  a  long  or  short  code  is  used.     However,   for  narrow-band 
noise   (y  =  0  or  y  <  <*),   the  PG  performance  obtained  drops  steadily  as  the  time  bandwidth 
product  of  the  code  is  reduced  further  below  the  time  bandwidth  product  of  the  system  (with 
the  same  total  integration  time  and  Bs  retained  in  all  comparisons).     In  summary,  we  easily 
see  from  Figure  1  the  quantitative  performance  difference  that  occurs  when  Ts  <  Tj.     In  gen- 
eral, when  the  jammer  noise  is  narrow-band,  we  will  not  achieve  the  maximum  performance  pos- 
sible when  Ts  <  Tj.     This  provides  motivation  for  the  use  of  an  advanced  correlator  capable 
of  providing  a  large  ITBW  and  yet  simultaneously  providing  a  large  range  delay  search.  The 
processing  gain  performance  demanded  in  a  given  application   (and  the  type  of  noise  to  be  ex- 
pected)  determines  the  ITBW  system  required;   however,   in  general,   the  ITBW  of  the  system  is 
matched  to  the  STBW  of  the  code  and  as  much  PG  as  possible  is  desireable.     We  thus  consider 
a  new  processor  that  can  achieve  this. 

3.      PRODUCT  CODES 


The  specific  type  of  spread  spectrum  code  we  consider  is  a  product  code.     In  one  form, 
such  a  code  can  be  produced  by  the  bit-by-bit  product  of  two  shorter  repeated  PR  direct  se- 
quence codes.     Extension  of  the  basic  concept  to  other  codes  and  the  use  of  the  TPP  in  pro- 
cessing other  types  of  product  codes  are  realistic.     At  present,  we  consider  only  a  direct 
sequence  PR  biphase  product  code.     In  such  a  case,   two  codes  u(t)   and  v(t)    consisting  of  Nu 
and  Nv  bits  respectively  are  used.     Each  code  is  biphase  coded  with  bit  values  +1  and  a  bit 
duration  Tp  =  1/BS  and  time  sequenced  lengths  T     =  NuTp  and  Tv  =  NvTp  for  each  code.  We 
form  the  bit-by-bit  product  of  two  such  repeated  codes.     If  Nu  and  Nv  are  relatively  prime, 
the  product  sequence  will  have  the  same  statistical  nature  as  u(t)   and  v(t)    and  a  period 
NsTp  =  NuNvTp  =  Ts.     This  is  easily  seen  by  considering  that  u(t)    and  v(t)    are  repeated  Nv 
and  Nu  times  respectively  and  that  the  code  multiplication  is  over  NUNV  bits.     By  definition 
Nu  and  Nv  are  both  integer  divisors  of  Ns  and  thus  Ns  contains  the  primes  of  Nu  and  Nv. 
Likewise,  Ns  must  equal  NUNV  divided  by  the  largest  common  divisor  of  Nu  and  Nv.  Since 
these  have  no  common  divisor,  Ns  =  NUNV  as  noted  earlier.     One  example  of  such  a  product 
code  generation  technique  is  the  Gold  code   [6] . 

Such  a  product  code  is  very  attractive  for  many  reasons.     It  can  easily  be  generated  us- 
ing only  two  shorter  direct  sequence  PR  codes.     It  also  allows  for  the  use  of  a  novel  syn- 
chronization and  demodulation  technique   (Section  4).     However,   these  codes  also  have  another 
quite  attractive  property  that  simplifies  the  design  of  a  range  searching  processor  using 
such  codes.     Specifically,   is  we  slip  one  of  the  two  reference  codes   (e.g.  u(t))   used  to 
generate  the  reference  product  code  by  n  bits,   the  resultant  product  code  sequence  generated 
by  the  bit-by-bit  product  of  these  two  reference  product  codes  is  slipped  by  Nv  bits.  We 
use  this  feature  in  our  spread  spectrum  product  code  TPP  processor  in  Section  4. 

4.      THE  TPP  FOR  PRODUCT  CODE  SYNCHRONIC AT ION  AND  DEMODULATION 


Let  us  now  consider  use  of  the  TPP  system   (Section  1)   with  a  direct  sequence  PR  product 
code   (Section  3)    to  provide  self -synchronization ,   long  integration  time,   a  large  PG  and  a 
large  range  delay  search    (as  required  in  Section  2) .     This  technique  follows  our  theory 
first  advanced  in   [5] .     The  system  we  propose  is  shown    [5]    in  Figure  2.     In  this  case,  the 
output  in   (1)  is 


R(t1,t2)   =    /u(t)v(t  -  x1)s(t  +  x2)dt.  (5) 
TI 


The  received  signal  s(t)   and  the  two  shorter  product  codes  u(t)    and  v(t)    are  fed  to  the  TPP 
as  shown.     The  two  output  parameters  tj_  =  x^/vs  and         =  x2^vs  must  both  be  less  than  the 
aperture  time  TA  of  the  AO  cell,  where    (x^,X2)   describes  the  spatial  coordinates  of  the  two 
cells  and  vs  is  the  velocity  of  sound  in  the  AO  cells.     We  omit  the  detailed  modulation 
mechanisms  and  side-band  filtering  performed  in  the  TPP  to  obtain  an  output  in  the  form  of 
(1)   or    (5) .     These  are  detailed  elsewhere   [7,8] .     We  also  omit  the  continuous  carrier  terms 
on  which  the  three  signals  are  recorded.     This  is  valid  since  the  output  in   (6)    is  obtained 
when  the  sum  and  difference  of  the  three  carriers  is  zero.     We  also  replace  the  continuous 
shift  variables         and  x2  by  the  discrete  parameters  i  =  Tj_/Tp  and  j  =  T2/Tp,  where  i  and  j 
are  integers  and  the  2-D  TPP  output  at  discrete  detector  locations    (corresponding  to  the 
bits  of  the  product  code)  is 

R(i,j)   =    /u(t)v(t  -  iT  )s(t  +  jT  )dt.  (6) 

m  ir  Ir 
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FIGURE  2     Schematic  diagram  of  a  spread  spectrum  product  PR  code 
triple  product  processor  [5]. 


In  our  case,  s(t)  =  u(t)v(t)  is  a  product  code, 
spread  spectrum  product  code  has  the  property  that 


If 


AN,  where  AN  =  1,  this 


s ( t  -  kN  T  ) 
u  p 


u(t)  v(t 


kANT  ) 
P 


(7) 


This  is  seen  by  realizing  that  u(t)    is  periodic  in  NuTp,  v(t)    is  periodic  in  NvTp  and  AN  = 
lNu  ~  Nv I    ^s  the  difference  in  the  bit  lengths  of  the  two  individual  codes.     For  the  case 
we  consider  Nu  =  Nv  -  1  and  AN  =  1 ,   from   (7)   we  see  that  if  v(t)    is  shifted  by  one  bit   (k  = 
1) ,    the  resultant  product  code  s(t)    is  shifted  by  Nu  bits.     This  is  a  key  feature  of  the 
product  code  that  enables  us  to  use  the  TPP  for  spread  spectrum  processing.     In  general, 
shifting  v(t)    in  steps  of  ANTp  shifts  the  delay  in  the  product  signal  s(t)    in  steps  of  NuTp. 

With  AN  =  1 ,   the  P4  output  plane  in  Figure  2  is  approximately  square  and  we  can  use  two 
AO  cells  with  essentially  the  same  time  bandwidth  products  and  aperture  times.     In  this 
case,  we  find 

s(t  -   iN  T  )    =  u(t)v(t  -   iT   )  .  (8) 
up  •  p 

Using   (8)    in   (6),   the  system's  output  becomes 


R(i,j)    =    fs(t  -  iN  T   )s(t  +   jT  )dt 

rp  U     p  P 


(9) 


From   (9) ,  we  see  that  the  output  for  the  case  of  a  received  product  code  signal  delayed  with 
respect  to  the  same  reference  product  code  is  the  desired  auto-correlation  of  the  product 
code.     Each   (i,j)   point  in  the  output  plane  corresponds  to  a  delay 

t„  =  iT     +  jT  (10) 
D  u       J  p 

between  the  received  and  reference  codes.  This  output  format  corresponds  to  fine  and  coarse 
time  delay  axes.  The  horizontal  output  coordinate  i  is  coarse  range  with  units  of  Tu  =  NuTp 
per  line  and  the  vertical  coordinate  j  of  the  output  is  fine  range  (with  a  resolution  of  one 
bit  (Tp  of  the  code) .  The  total  2-D  output  plane  covers  a  full  range  delay  of  to  -  NuNvTp  = 
NsTp  or  the  length  Ts  of  the  full  product  code.  When  the  TPP  system  is  operated  as  shown  in 
Figure  2,   it  provides  the  full  output  correlation  covering  the  full  Ts  of  the  code  and  from 
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the  location  of  the  correlation  peak  the  range  delay  can  be  found.     The  correlation  peak 
has  the  full  PG,   SNR  and  sidelobe  levels  of  the  full  product  code.     Such  a  system  thus  al- 
lows one  to  search  the  entire  Ts  of  the  long  code  and  to  achieve  the  full  PG  of  the  long 
code  and  a  range  delay  accuracy  Tp  of  a  single  bit  all  within  a  single  processing  time  of 
only  Tj  =  Ts  of  the  code.     A  more  general  and  detailed  analysis  of  this  operation  of  the  TPP 
has  recently  been  submitted  for  publication  by  us  [5]. 

5_.  EXPERIMENTAL  VERIFICATION 

As  our  product  code,  we  chose  Nv  =  31  and  Nu  =  Nv  -  1  =  30.     In  general,  we  would  select 
the  u(t)   and  v(t)    codes  to  each  have  the  lowest  sidelobe  levels.     This  provides  the  lowest 
sidelobe  levels  for  the  resultant  product  code.     We  select  v(t)    to  be  a  maximal  length  or 
M-sequence   (since  this  provides  1/NV  sidelobe  levels) .     Since  Nu  =  Nv  -  1,  we  select  u(t)  to 
be  a  balanced-sequence   (with  equal  numbers  of  zeros  and  ones)    of  length  p™  -  1 ,  where  p  is 
an  odd  prime  number  and  m  is  an  integer.     When   (pm  -  l)/2  is  odd,   this  u(t)    code  has  maximum 
sidelobe  levels  of  2/Nu. 

For  our  specific  example,   Nv  =  31,   Nu  =  30,   p  =  31,  m  =  1  and  a  31-bit  M-sequence  and  a 
30-bit  balanced-sequence  were  used.     In  Figure  3,  we  show  the  output  correlation  plane  for 
these  signals  for  two  cases.     In  Figure  3a,   the  delay  t^  between  the  received  and  reference 
product  code  was  t^^  =  668.     This  corresponds  to  a  peak  on  the  eighth  detector  j  =  8  on  the 
22-nd  line  i  =  22,   or    (iNu  +  j ) Tp  =   (22  x  30  +  8)Tp  =  668Tp.   In  Figure  3b,   the  delay  was 
td2  =  662  bits  and  the  peak  occurs  at  i  =  8  and  j  =  22  corresponding  to  8  x  30  +  22  =  262 
as  expected.     The  sidelobe  patterns  present  in  Figure  3  are  correct.     They  can  be  determined 
by  translating  our  origin  to  the  peak  location    (i,j)    and  recalling  that  the  product  code  has 
sidelobe  peaks  at  integer  multiples  kTu  and  case  Tv  of  the  periods  of  each  signal.     To  deter- 
mine the  location  of  these  sidelobes,  we  set  kTu  =  iTp  +  jTu,   its  solution  is  j  =  k  and  i  = 
0.     This  describes  a  line  passing  through  the  peak  and  parallel  to  the  j  axis  as  shown.  For 
the  kTv  sidelobes,   we  set  kTv  =  iTp  +  jTu  and  recall  that  Nu  -  Nv  =  1 .     We  find  that  the 
sidelobes  now  occur  on  the  45°   line    (i  =  j)    through  the  peak.     The  data  in  Figure  3  verify 
the  expected  results. 


FIGURE  3     Output  plane  data  from  the  triple  product  processor  of  Figure  2 
with  a  spread  spectrum  product  PR  code  for  two  different  range 
delays    [ 5] . 
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Abstract 


A  CCD-based  two-dimensional  correlator  system  is  described  which  correlates  a 
256  x  256  image  with  a  32  x  32  reference  in  less  than  1  second.  The  high  computation 
rate  (more  than  100  million  operations  per  second)  is  achieved  using  two  high  speed 
CCDs:  a  32-stage  programmable  transversal  filter  (PTF)  which  correlates  an  analog 
signal  with  a  set  of  32  6-bit  tap  weights  at  a  5  MHz  rate,  and  an  accumulating  memory 
which  sums  successive  correlation  records  from  the  PTF.  This  system  uses  a  technique 
which  performs  a  series  of  one -d ime ns i ona 1  correlations  in  the  PTF  and  sums  them  in  the 
accumulator  to  form  the  two-dimensional  correlation.  This  approach  is  capable  of 
considerable  flexibility  and  can  be  extended  to  correlations  of  much  larger  image  and 
reference  sizes  even  with  a  transversal  filter  of  limited  length,  and  also  to 
correlations   of   data   of   more    than   two  dimensions. 


Introduction 


Two-dimensional  correlation  is  of  much  interest  for  processing  imagery  and  other 
data  which  occur  in  a  two-dimensional  format.  Examples  of  its  use  include  matched 
filtering  for  pattern  recognition,  correlation  tracking  and  high  and  low  pass  filtering 
for  edge  detection  and  smoothing.  In  general  this  process  becomes  very  computation 
intensive  as  the  reference  and  signal  array  sizes  grow,  and  fast,  special-purpose 
digital  processors  are  often  needed  to  perform  the  computations  in  a  reasonable  time. 
Because  of  their  highly  efficient  computational  power,  CCDs  should  be  attractive  for 
this  type  of  processing,  particularly  in  systems  where  compactness  and  low  power 
consumption  are  essential. 


However,  the  use  of  CCDs  for  performing  two-dimensional  correlations  has  a  rather 
limited  history.  Hall,  et  al  have  described  a  CCD  imager  with  a  9-tap  delay  line  output 
which  allows  correlation  of  the  image  with  a  3  x  3  kernel '  .  The  tap  weight  values  are 
set  and  summing  programmed  by  off-chip  circuitry.  Fouse,  Nudd  and  Nygaard  have 
described  various  CCD-based  circuits  which  apply  fixed  weight  kernels  of  various  sizes, 
and  a  CCD  device  capable  of  convolving  an  image  with  a  5x5  voltage  programmable 
kernel^.  The  dynamic  range  of  the  programmable  device  was  reported  to  be  14  gray 
levels. 
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System  concept 


A  diagram  of  the  2-D  correlator  system  is  shown  in  Fig.  I.  The  signal  and 
reference  data  are  stored  as  8-  and  6-bit  words  respectively,  and  coded  in  2's 
complement  format.  The  correlation  output  is  likewise  digitized  to  8  bits  and  stored  in 
a  256  x  256  memory  where  it  can  be  accessed  either  for  display  or  by  the  computer.  The 
actual  correlation  computation  is  performed  by  two  CCDs:  a  programmable  transversal 
filter  (PTF),  and  an  accumulating  memory.  The  PTF  is  a  32-stage  device  whose  tap  weight 
values  are  digitally  programmable  as  6-bit  words  (5-bits  plus  sign  bit)-*  and  is 
described  further  in  the  next  section.  The  accumulating  memory  consists  of  a  shift 
register  with  a  charge  storage  site  beside  each  shift  register  cell.  A  sequence  of 
charge  packets  can  be  loaded  in  the  shift  register  and  pa r a  1 1 e 1 -t r a ns f e r r e d  into  the 
storage  sites  where  they  are  added  to  previously  stored  packets.  Later  the  contents  of 
the   storage    sites   can   be    retrieved    into    the    shift    register   and   clocked  out. 
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In      this    system  we   perform  a      series  of 
and   sum  them  in   the   accumulating  memory  to 
r(i,j)    are    the    signal   and   reference   values  c 
array,    then   the   c r o s s -c o r r e la t ion   c(m,n)  is 


one-dimensional  correlations  using  the  PTF 
produce  the  2-D  correlation.  If  s(i,j)  and 
orresponding  to  row  i  and  column  j  of  each 
given  by 


31  31 

c(m,n)    =    I        I  r(i , j)s(i+m, j+n)      m , n =0 , 1 , . . . . 2 5 5  (1) 

1=0  j=0 

=   J1    p( i , i+m , n ) 
where  i=0 

31 

p(i,i+m,n)=    £  r ( i , j ) s ( i+m, j+n)  (2) 
J-0 

Equation  2  defines  a  partial  or  one-dimensional  correlation  p(i,i  +m ,n),  n  =  0,l,...255 
which  is  the  correlation  of  row  i  of  the  reference  with  row  i+m  of  the  signal.  Equation 
1  shows  that  32  of  these  1-D  correlations  are  then  vector  summed  to  produce  c(m,n)  or 
row  m  of   the   cross-correlation  matrix. 


In  our  system  the  PTF  computes  the  p-vectors  while  the  accumulating  memory  sums 
these  vectors.  This  procedure  can  also  be  described  as  follows:  Row  0  of  the  reference 
is  loaded  into  the  PTF  tap  weight  storage  and  correlated  with  row  0  of  the  signal,  and 
the  result  is  stored  in  the  accumulator.  Then  row  1  of  the  signal  and  row  1  of  the 
reference  are  correlated  and  the  result  added  to  the  first  correlation.  This  process 
continues  until  rows  31  of  the  signal  and  reference  are  correlated.  The  accumulator 
then  contains  the  first  row  of  the  cr o s s -co r r e la t i on  which  is  clocked  out.  To  produce 
the  second  row  of  the  cross-correlation,  reference  rows  0  through  31  are  correlated  with 
signal  rows  1  through  32,  and  for  each  succeeding  cycle  the  starting  address  of  the 
signal   memory   row   is   advanced  by  one. 
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in  both  dimensions 
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two-dimensional  di 
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size   256   x  256. 
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yclic  convolution 


ices  in  Equations 
lations  are  cyclic 
This  is  the  same 
is  computed  using 
s   are      implicit  in 


The  correlation  system  in  Fig.  1  is  capable  of  expansion  to  larger  signal  and 
reference  array  sizes.  The  width  (number  of  columns)  of  the  signal  memory  is  limited  by 
the  number  of  storage  sites  in  the  accumulator,  while  the  height  of  the  signal  array 
could  be  expanded  indefinitely.  The  width  of  the  reference  array  is  set  equal  to  the 
number  of  taps  on  the  PTF,  while  the  reference  height  is  limited  by  the  charge  storage 
capacity  of  the  accumulator.  Actually,  the  reference  width  can  be  made  wider  (within 
limits  imposed  by  the  accumulator  charge  capacity)  by  breaking  each  row  of  the  reference 
into  segments  of  32  or  less  and  correlating  each  segment  with  a  given  signal  row.  In 
this  case  the  addressing  circuitry  of  the  signal  memory  must  start  at  different  column 
indices  for  each  reference  segment  to  ensure  that  the  partial  correlations  are  properly 
aligned   when   summed    in   the  accumulator. 


In  addition  to  these  possibilities,  thi 
dimensions  for  signal  and  reference.  Thus 
combination  as  the  key  computational  compone 
configured  simply  by  modifying  the  address 
memories. 


s  concept  can  be  extended  to  any  number  of 
,  using  the  transversal  f i 1 t e r / accumula t o r 
nt,  any  multi-dimensional  correlator  can  be 
ing   and      timing    logic      and    the      size      of  the 


Description   of   correlator  system 


As  previously  stated,  the  CCD  PTF  and  accumulator  perform  the  major  computational 
load.  The  structure  of  the  PTF  is  diagrammed  in  Fig.  2.  The  device  architecture  is  of 
the  so-called  "pipe-organ"  type  in  which  the  time  delayed 
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samples  of  the  analog  input  are  obtained  from  a  bank  of  CCD  delay  lines  of  varying 
length.  The  principle  attraction  of  this  structure  is  that  the  signal  is  detected  as 
charge  packets  using  a  conventional  gated  c ha r ge -in t egr a t o r  output  circuit.  This  type 
of  output  circuit,  by  virtue  of  its  simplicity,  has  a  higher  bandwidth  than  the  tapped- 
delay-line  circuits  which  must  sense  current  and  maintain  the  tapping  electrodes  at  a 
virtual  ground  potential,  and  is  also  easier  to  shield  from  clock  feedthrough  than 
tapping   structures   which  must    be    interleaved   with   clocked  electrodes. 

The  tap  weight  values  are  determined  digitally  through  a  multiplying  D/A  converter 
(MDAC)  at  the  input  to  each  delay  line.  The  MDAC,  shown  in  Fig.  3  ,  is  a  CCD  structure 
with  multiple  inputs,  each  input  corresponding  to  one  bit  of  the  tap  weight  and  having 
an  input  gate  (G2)  whose  channel  area  at  the  input  for  each  bit  is  a  factor  of  2n  (for 
bit  'n')  larger  than  the  smallest  such  gate.  Each  input  also  has  an  additional  gate  Gl 
which  turns  the  charge  flow  on  or  off  thereby  effecting  a  multiplication  of  the  charge 
packet  by  0  or  1.  The  voltage  level  on  this  gate  is  controlled  by  the  logic  which 
stores  the  tap  weight  values.  The  tap  weight  logic  has  a  buffer  storage  capability  so 
that  a  new  set  of  weights  can  be  read  into  the  logic  while  the  filter  is  operating  with 
a  previous  set  of  weights.  The  tap  weight  storage  can  then  be  quickly  updated  with  the 
next  weights  in  one  clock  period.  In  order  to  use  bipolar  tap  weights  we  use  a  separate 
filter  on  the  same  chip  which  correlates  the  signal  with  the  sign  bits  (MSBs)  and  the 
outputs  of  the  two  filters  are  subtracted  in  an  off-chip  differential  amplifier.  This 
device  has  been  operated  at  clock  frequencies  up  to  25  MHz  with  a  dynamic  range  of 
50  dB,    although   for    this   correlator   system  we   use   a   clock   rate   of    5  MHz. 

For  the  accumulator  we  were  able  to  use  a  part  of  a  100-  x  400-pixel  CCD  imager 
which  has  been  described  elsewhere'1'^.  This  device  is  depicted  in  Fig.  4  and  has  a  405- 
stage  output  register  with  an  electrical  input.  The  design  of  this  device  permitted  us 
to  transfer  charge  from  the  output  register  into  the  bottom  row  of  400  imaging  cells 
which  served  as  storage  sites.  Although  this  device  was  convenient  for  our  purposes,  it 
fell  short  of  being  optimum  for  this  application  in  two  respects.  First,  the  limited 
storage  capacity  of  the  imaging  wells  forced  us  to  use  relatively  small  input  signal 
amplitudes,  and  this  in  turn  meant  that  the  input  structure  operated  in  a  somewhat 
nonlinear  portion  of  its  voltage-to-charge  transfer  characteristic.  Secondly,  a 
structure  having  independent  input  and  output  registers  would  be  desirable  because  the 
two  registers  could  be  clocked  independently,  and  this  in  turn  would  mean  that  the  input 
and   output   data    flow  could   be   nearly  continuous. 

The  two  CCDs,  all  of  their  support  circuitry,  and  the  D/A  and  A/D  converters  shown 
in  Fig.  1  have  been  incorporated  in  three  circuit  boards  which  are  shown  in  Fig.  5.  The 
D/A  converter  at  the  input  to  the  PTF  and  A/D  converter  at  the  accumulator  output  are 
located  on  these  boards.  The  input  data  rate  to  the  PTF  is  5  MHz,  while  the 
accumulating  memory   is   clocked   out    at    1  MHz. 

A  photograph  showing  the  inside  of  the  system  is  shown  in  Fig.  6,  and  reveals  a 
cage  of  32  cards  (mostly  wire-wrap)  and  a  control  panel.  In  addition  there  are  3  CRT 
displays  not  visible  in  the  photo.  The  front  panel  controls  include  many  functions 
designed  to  enhance  the  system  capabilities.  The  256  x  256  signal  input  is  actually 
obtained  by  summing  a  32  x  32  array  which  simulates  a  target  and  a  256  x  256  array  which 
simulates  background  noise  or  clutter.  Two  of  the  front  panel  controls  permit  the 
32  x  32  signal  to  be  moved  anywhere  in  the  256  x  256  noise  field.  Another  set  of 
switches  allows  the  operator  to  attenuate,  by  powers  of  2,  either  the  32  x  32  signal  or 
the  256  x  256  background,  and  this  provides  a  convenient  way  of  varying  signal  to  noise 
ratios.  The  reference  memory  has  the  capacity  to  hold  32  different  32  x  32  patterns, 
and  in  one  mode  of  operation  the  system  performs  a  compute-and-display  operation  using 
each  successive  reference  pattern  which  is  useful  when  looking  for  correlation  matches 
with  various  rotated  or  magnified  versions  of  the  reference.  A  thresholding  capability 
is  available  when  displaying  the  correlation  output,  and  only  correlations  above  a  set 
level   are   displayed   with    this  feature. 

A  large  fraction  of  the  32  cards  are  allocated  to  the  memory  (14  cards),  while  10 
cards  supply  the  remaining  logic  functions  such  as  clocking,  memory  control  and  computer 
interface  and  5  cards  supply  the  display  drive  signals.  The  system  size  could  be  easily 
reduced  with  a  modest  re-design  effort,  with  the  largest  reduction  occuring  if  the  8K 
static   RAMs   used    in   the   memory  were    replaced   with    64K  dynamic  RAMs. 
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System  performance 


We  have  performed  preliminary  tests  to  demonstrate  the  system  capabilities,  and 
some  results  are  shown  in  Figs.  7  and  8.  In  Fig.  7a,  a  test  signal  is  shown  in  a  quasi- 
3D  display  and  consists  of  a  32-  x  32-pixel  pyramid  shape  against  a  zero  background. 
The  pyramid  height  is  64,  or  1/2  the  maximum  amplitude  of  127  (signal  amplitude  range  is 
-128  to  127).  The  32  x  32  reference  for  these  tests  was  the  same  pyramid  but  with  6-bit 
resolution.  This  shape  was  chosen  because  it  utilizes  a  variety  of  tap  weight 
configurations  on  the  PTF  and  because  it  was  easy  to  generate  in  software.  Figure  7b 
shows  the  resulting  correlation  which  has  been  digitized  and  stored  in  the  correlation 
memory,  while  Fig.  7c  shows  the  calculated  result.  The  striations  in  the  CCD 
correlation  are  a  result  of  clock  feedthrough  which  can  be  removed  by  po s t -pr oce s s i ng . 
The  dynamic  range  (peak  output  signal/rms  noise)  is  36  dB  for  the  result  in  Fig.  7b,  but 
with  po s t -pr oce s s i ng  to  remove  the  clock  feedthrough  the  dynamic  range  is  increased  to 
about  39  dB.  In  Fig.  8a  we  have  added  to  the  pyramid  uncorrelated  Gaussian  noise  of  rms 
value  64  so  that  s i gna 1 /no i s e = 1 .  The  2D-cor re lat ion  result  shown  in  Fig.  8b  clearly 
reveals  the  presence  of  the  signal.  In  Figs.  8a  and  8b  the  y-axis  amplitude  has  been 
reduced   from   that    of    Fig.    7    by   a    factor   of    4    for   visual  clarity. 

The  results  shown  here  required  1  second,  but  the  high  speed  correlations  by  the 
PTF  and  vector  summing  by  the  accumulator  required  only  about  .5  seconds.  Approximately 
0.4  seconds  of  the  remaining  time  could  be  eliminated  if  the  PTF  logic  were  designed  for 
higher  speed  operation.  The  remaining  0.1  seconds  is  the  time  required  to  read  the 
correlation  result  out  of  the  accumulator,  and  could  be  eliminated  with  an  accumulator 
having  independent  read-in  and  read-out  register  capabilities.  Further  increases  in 
speed  can  occur  by  using  multiple  cor r e la t o r / accumula t o r  pairs  and  dividing  the 
computation  work  among  them.  For  example,  with  two  correlator/accumulator  pairs  one 
pair  could  be  used  to  correlate  the  even  lines  and  the  other  the  odd  lines  of  the  signal 
with  reference.  The  two  accumulator  outputs  would  then  be  summed  to  produce  the  desired 
correlation. 

Summary 

A  CCD-based,  two-dimensional  correlator  system  has  been  described  which  correlates 
a  32  x  32  x  6-bit  reference  with  a  256  x  256  x  8-bit  signal  in  1  second.  The  system 
relies  on  the  computational  power  of  two  CCDs,  a  programmable  transversal  filter  and  an 
accumulating  memory  to  perform  the  computations  at  high  speed.  The  dynamic  range  is 
36  dB  and  is  presently  limited  by  clock  feedthrough  at  the  output  of  the  PTF. 
Considerable  improvements  in  computation  speed  are  possible  by  incorporating  some  design 
improvements  into  the  PTF  and  accumulator  and  by  using  multiple  PTF/accumulator  pairs  to 
operate    in   parallel   on   the  data. 
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Figure  2.  Diagram  of  the  CCD  programmable 
transversal  filter  (PTF)  used  in  the  2-D 
correlator. 
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Figure  A.  Diagram  of  a  100  x  400  element  CCD 
image  sensor.  The  output  register  and  bottom 
row  of  imaging  array  cells  were  used  as  the 
accumulating  memory    for    the  correlator. 


Figure  3.  Diagram  of  the  multiplying  D/A 
converter  (MDAC)  used  at  the  input  to  each 
delay   line   of    the  PTF. 


Figure      5.      Photograph   of      the  CCD 

circuit      boards.          All      the  CCD 

support      circuitry   as   well      as  the 

D/A  and  A/D  converters  are  on 
these  boards. 
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Fieure  8a.  The  same  signal  as  displayed  in 
Figure  7a  but  with  added  uncorrected 
Gaussian  noise  of  rms  value  equal  to  the 
pyramid  height. 


Figure        7c.  Exact  calculation 

correlation   depicted   in   Figure  7b. 


of  the 


Figure  8b.  The  correlation  of  the  signal  in 
Figure        8a        with        the  32   x   32  pyramid 

reference  . 
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Abstract 

This  paper  describes  a  method  by  which  optical  time-  and  space-integrating  processors 
may  be  generalized  and  expanded  by  accommodating  vector  inputs.     Many  architectures  using 
this  frequency  division  multiplexing  approach  are  presented,   demonstrating  new  operations 
and  generalizations  of  old  ones. 

Introduction 

This  paper  describes  a  method  by  which  optical  time-  and  space-integrating  correlators 
may  be  generalized  and  expanded.     The  conventional  correlators,  which  are  well  documented ^ 2 
may  be  thought  of  as  processing  scalar  functions  of  time,   say,    f(t)    and  g(t).     The  time 
integrators  and  some  space  integrators  provide  an  ensemble  of  outputs 

{  [         f  (A)    g(A-T)    d A :   x   e   I } 
J  t_T 

or 

{[         p(A)    f(A-x)    g(A-a)    dA:    x,a  e  1} 
J  t_T 

for  some  interval  I  and  discrete  observation  time  t.     Many  space-integrating  processors 
produce  the  continuous- time  output 

T/2 

f(t-A)    g(t+A)  dA. 

-T/2 

The  extension  to  be  described  here  involves  the  acceptance  of  vector  inputs 

f(t)  =    (fi(t),  f2(t),  f   (t)),  (1) 

—                1  n 

and 

g(t)  =    (gi  (t)  ,  g2  (t)  ,  .  .  .  ,   gn(t)  )  .  (2) 

These  vectors  are  encoded  as  scalar  functions  f(t)   and  g(t)   by  frequency  division 
multiplexing   (FDM) ,   putting  the  vectors  in  a  form  acceptable  to  conventional  input 
modulators.     In  particular,    the  encoding  is  accomplished  as  shown  in  Figure  1.  Adjacent 
elements  of  f_  are  placed  on  carriers  differing  by  bQ  Hertz.     The  translated  elements  are 
then  summed  to  produce  f.     It  is  assumed  that  each  signal  f-^(t)    is  confined  to  the 
frequencies    (-B,   B) .     bQ  is  large  enough  to  avoid  aliasing  by  accommodating  the  component 
bandwidth  and  a  guard  band  of  width  Bg,   that  is, 

b     =  2B  +  B    .  (3) 

o  g  *  ' 

The  value  of  Bg  will  be  determined  by  the  architecture.     Each  input  vector  is  encoded  in 

the  identical  manner,   using  the  same  value  of  b  . 

3  o 

In  addition  to  providing  an  input  acceptable  to  modulators,    the  use  of  the  FDM  encoding 
scheme  allows  controlled  interaction  between  vector  elements  and  simple  separation  of 
desired  outputs  by  transform  plane  filtering. 

This  paper  will  briefly  describe  a  variety  of  Bragg  cell  architectures  which  make  use  of 
the  FDM  approach.     It  should  be  noted  that  although  scalar  inputs  have  been  replaced  by 
vector  inputs,   the  information  that  can  be  accommodated  by  the  inputs  is  not  increased,  and 
is,   in  fact,   somewhat  decreased.     This  is  because  the  input  modulators  must  now  admit  a 
bandwidth  NbQ  -  Bq.     Hence,   if  the  modulator  bandwidth  is  fixed,   B  must  become  small  as  N 
is  made  large.     The  effect  is  a  redistribution  of  the  modulator  time-bandwidth  product. 
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The  requirement  of  a  guard  band  causes  a  loss  in  input  capacity.  As  will  be  seen,  this  loss 
may  well  be  justified. 

In  the  architecture  descriptions  which  follow,   it  will  be  assumed  that  the  signals  f  and 
g  have  been  properly  translated  to  the  frequency  band  appropriate  for  the  Bragg  cells. 
Also,   the  trivial  matter  of  removing  light  undiffracted  by  modulators  will  be  ignored. 


f(t>  * 


Figure  1.     FDM  Encoding  of  f (t) . 

Architectures 

Architecture  1  -  stacked  correlator   (time  integrating) 

The  first,   and  perhaps  most  obvious,    use  of  the  FDM  approach  is  a  "stacked"  time- 
integrating  correlator.     This  is  a  means  of  calculating  the  cross-correlation  functions 
between   fj_  and  g^  for  all  ie   {1,  N }  simultaneously.     Figure  2  diagrams  such  a 

processor. 

Bragg  cell  A01  is  driven  by  f(t),  causing  it  to  modulate  the  transmitted  light  by  either 
the  analytic*  signal 

[f(t)]+  =  h[f(t)  +  jfH(t)]  (4) 
or  the  "antianalytic "  signal 

[f(t)]_   =  hlf(t)   -  jf„(t)]  (5) 


associated  with  f(t),  depending  upon  whether  the ^positive  or  negative  diffraction  order  is 
chosen,  where  fH(t)  is  the  Hilbert  transform  of  f(t).  In  this  processor  we  will  choose  to 
use   [f(t)]+  and  [g(t)]+.     The  processor  will  take  the  products 


N  N 


[f(t+x)]+  [g(t-t)3+  =   [    J  fn(t  +  T)   e]2"b°n(t  +  T)]    •    [    2   gm(t"T)  e^V^t-T^ 

n=l  m=l  ,,. 

N  N 

Z  2   fn(t+T)    gm(t_T)   ej2*bo(n+m)t  ej2,b0(n-m)x 

n=l  m=l 


for  all  xel.     The  value  of  t  will  be  determined  by  spatial  position.     The  only  interest  here 
is  in  terms  where  n=m  in    (6).     Notice  that  these  terms  occur  with  spatial  frequency  centered 
about  zero.     Clearly,   if  bQ  is  chosen  to  be  large  enough,   the  terms  with  n-m  =  i  have 
spatial  frequencies  disjoint  with  those  having  n-m  /  i.     It  is  sufficient  for 

bQ  >  4B,  (7) 

since  f   (t+x)   g   (t-x)    is  constrained  to  the  spatial  frequency  interval    (-2B,   2B) .     Thus,  if 

n  Jn 


*The  analytic  signal  associated  with  f (t)    is  formed  by  removing  the  negative  frequency 
terms  of  f(t).     Similarly,   define  the  "antianalytic"  signal  to  be  f(t)  with  positive 
frequency  components  suppressed. 
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Bg  =  2B, 


(8) 


the  desired  n=m  terms  may  be  separated  from  the  undesired  terms  by  means  of  frequency 
filtering . 

In  addition  to  forming  the  products    [f(t+x)]+    [g(t-r)]+  and  filtering  to  select  only  the 
wanted  cross-products,   the  processor  must  separate  the  selected  cross-products  from  each 
other,   so  that  they  may  be  integrated  to  form  the  desired  correlations. 

The  signal  f(t)   drives  Bragg  cell  A01 .     Light  converging  vertically  and  passing  through 
A01  is  diffracted  with  an  amplitude    [f(t+x/v)]+,  where  x  is  the  horizontal  position 
coordinate  in  the  plane  of  the  cell,   and  v  is  the  acoustic  propagation  velocity.     A01  is 
located  in  the  front  focal  plane  of  spherical  lens  LI.     The  diffracted  light  passes  through 
Ll  and  reaches  spatial  filter  F2  located  in  the  back  focal  plane  of  Ll .      It  is  here  that 
components  of  f  are  forced  to  different  vertical  positions.     This  is  accomplished  as 
follows:     lens  Ll  forms  the  Fourier  transform  of  f  in  the  plane  of  F2 .     The  method  of 
illumination  causes  the  transform  to  be  spread  in  vertical  direction.     The  frequencies  of 
each  fi  are  seen  separated  horizontally  from  those  of  the  other  components  fj,   j/i  because 
of  the  FDM  scheme.     F2   is  opaque  except  at  rectangles  arranged  along  the  diagonal.     The  ith 
rectangle  is  horizontally  positioned  to  admit  all  of  the  light  from  fi(t)   and  none  from  the 
other  components.     Hence,   light  diffracted  by  fi(t)   emerges  from  F2  in  a  unique  vertical 
interval  disjoint  from  intervals  of  the  other  components.     This  vertical  division  will  be 
exploited  later  to  separate  individual  products. 

F2  and  cell  A02  are  located  in  the  front  and  back  focal  planes  of  L2 ,  respectively. 
Thus,   lenses ^Ll  and  L2  image  A01  into  the  plane  of  A02 .     A02 ,   driven  by  g  applies  the 
modulation   [g(t-x/v)]+  to  the  diffracted  light,  where  x  is  again  the  horizontal  position. 
The  x-variation  of  this  doubly  diffracted  light  is  imaged  via  spherical  lens  L3  and 
cylinder  lens  L4  onto  a  two-dimensional  detector  array  D  through  spatial  filter  Fl.     Fl  is 
in  the  transform  plane  of  the  L3-L4  combination.     Here,   the  transform  of    [f (t+x/v) ]+• 
[g(t-x/v)]+  with  respect  to  x  is  seen.     As  described  earlier,   Fl  blocks  all  undesired 
cross-product  terms  by  admitting  only  the  frequency  band    [-2B,   2B] . 

Lenses  L2  and  L3  image  the  vertical  variation  in  the  plane  of  F2  onto  D.     Thus,   the  ith 
row  in  D  receives  light  bearing  the  modulation  from  fj_.     Now  the  amplitude  of  light 
striking  the  ith  row  of  D  at  horizontal  position  x  is 


Ai(t,x)   =  ^ (t+x/v)   gi(t-x/v)  ej4irb0it. 
The  detector  pixel  here  integrates   |  AjJ  2  over  the  period  T  to  produce 


(9) 


fi  (t+x/v)   g±  (t-x/v)  I  2  dt. 


(10) 


If  incoherent  correlation  is  wanted,   then  the   {Cj_}  of    (10)   are  the  desired  results.  If 
coherent  correlation  is  required,   a  coherent  detection  scheme  may  be  employed3.     As  an 
example,   the  inputs 


f.(t)    =  F.(t)   ej2lTbt  +  e_j4wbt, 


,        „  j2wbt    ,      -j4Trbt  , 

gi(t)    =  Gi(t)   eJ  +  e  ,i=l,  N 


(11) 


may  be  used,  where  {FjJ  and  {Gj_}  are  constrained  to  the  frequency  band  (-b,  b)  =  (-B/4, 
B/4).     The  outputs  are 

Ci(x)    =  2Re  ejl2TTbx/v     J  Fi  (t+x/v)    Gi*  (t-x/v)   dt  +   f    [|Fj.(t+x/v)    Gi  (t-x/v)  |  2   +  1 

T  T 

+    |Fi(t+x/v)|2  +    |Gi (t-x/v) | 2]  dt 


(12) 


By  multiplying  (x)  by  cos  (  12tt  bx/v)  or  sin(127T  bx/v)  and  low-pass  filtering,  the  real  or 
imaginary  portions  of 


Fi  (t+x/v)   Gi    (t-x/v)  dt 
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are  obtained,  respectively. 


The  architecture,   then,   provides  a  means  of  generating  an  ensemble  of  correlation 
integrals  simultaneously.     With  the  above  coherent  detection  scheme,   each  correlation 
integral  has  a  resolution  of  l/(2b)   and  range  2TB,  where  TB  is  the  time  aperture  of  each 
Bragg  cell.     Thus,   4NbTg  =  NBTg  outputs  are  effectively  produced.     The  Bragg  cell  must 
possess  a  time-bandwidth  product  of  approximately  4NBTB.     A  conventional  correlator 
employing  the  same  detection  technique  provides  twice  as  many  outputs.     The  difference  is 
due  to  the  requirement  of  a  guard  band  between  vector  components.     Note  that  this  method  is 
a  means  of  partitioning  the  Bragg  cell  time-bandwidth  product  into  two  dimensions:  an 
increase  in  N  demands  a  decrease  in  B  for  fixed  time-bandwidth  product. 

Architecture  2  -  stacked  correlator   (space  integrating) 

This  architecture  is  equivalent  to  N  space-integrating  correlators,   producing     the  time 
functions 


fi(t+t)   gi(t-x)  di 


for  i  =  1,  N,   simultaneously.     Figure   3  is  a  diagram  of  the  processor.      It  is  similar 

to  architecture  1,   differing  only  in  that  L4  and  Fl  have  been  removed  and  that  the  detector 
array  has  been  replaced  by  a  column  of  N  fast  detectors  located  in  the  back  focal  plane  of 
L3,   horizontally  centered.     The  transform  with  respect  to  x  of  [f(t+x/v)]+  [g(t-x/v)]+  is 
formed  in  the  detector  plane  as  a  function  of  horizontal  position  x'.     It  is  divided 
vertically  by  the  contributing  component  in  f.     The  light  amplitude  seen  along  the 
horizontal  line  passing  through  ith  detector  is 


Ai(x')  = 


N 


f .  (t+x/v)  e 
i 


j2Trb0i  (t+x/v) 


Tv 


[  2  *n 


t-x/v)  e 


j 2irb0n  (t-x/v) 


■  j2TTXX 


dx 


n=l 


(13) 


I 

n=l 


ej2Trb0(i+n)t 


Tv 


j2irfcaL_ 

f. (t+x/v)    g   (t-x/v)   eJ      1  v  v 
l  ^n 


dx , 


where  T  is  the  Bragg  cell  aperture  T 


Now, 


A.  (o)   =  ej4Trbolt     f     f.  (t+x/v)    g.  (t-x/v)   dx , 


(14) 


Tv 

since  the  other  terms  have  no  frequency  components  near  zero  and  do  not  contribute*, 
provided  that 


B     =  2B  +  A, 

g 

where 

A   >>  1/T. 

Since  the  detectors  are  located  at  x'  =  0,   the  ith  detector  sees 


(15) 


(16) 


fi(t+x)    g. (t-x)  dt 


Architecture  3  -  correlation  outer  product  generator 


Equation  (13)  strongly  suggests  an  extension  of  the  previous  architecture, 
evaluating  Aj_(x')   merely  at  x'  =  0,   note  that  for  integer  k, 


Rather  than 


This  is  an  approximation,   since  the  integration  period  is  finite. 
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Ai(5o,   =  ej2wb0(2i-k)t     j     f.(t+x/v)    g._k(t-x/v)  dx, 

Tv 


(17) 


provided  that  i-k  e {1/  ...»  N}.  Thus,  a  fast  detector  placed  at  x'  =  kbQ/v  in  the  ith  row 
sees 


v2  |J  fi(t+t)  gi_k(t-T)  di|2. 


Note  that  k  may  assume  positive  and  negative  values,  so  that  correlations  between  f.  and 
g_.    for  all  i,j   e{l,  N}  may  be  calculated  simultaneously. 


f(t) 


a        N  FAST 
0  DETECTORS 


f(t) 


Sjo0  DETECTOR 
3>°*A  ARRAY 


9(t) 


Figure  3.   Space-integrating  Stacked  Correlator.    Figure  4.     Vector  Inner  Product  Correlator. 

Architecture  4  -  vector  inner  product  correlator 

Consider  the  variation  of  architecture  1  pictured  in  Figure  4.     This  is  merely  Figure  1 
with  F2  removed,   L3  and  L4  replaced  by  a  single  imaging  sphere  L3,   and  the  two-dimensional 
integrating  detector  replaced  by  a  one-dimensional  integrating  detector.     The  effect  of  this 
change  is  to  sum  the  contributions  from  all  components  of  f  on  the  same  detector  row. 
Reference  to  equation    (9)    shows  that  the  amplitude  of  light  striking  the  detector  array  at 
position  x  is 


A(t 


,x)   =      /     f .  (t+x/v)    g.(t-x/v)  ej47rbolt 


i=l 


A  pixel  located  at  position  x  will  integrate    |A(t,x) |2  over  period  T,  producing 
N 


Kx)  = 


| fi (t+x/v)    gi(t-x/v)|2  dt  +  R(x), 


T  1=1 


(18) 


(19) 


where 


R(x) 


If 


N  N 

^      2    f   If±  (t+x/v)    gk(t-x/v)|2  e^b°(i-k)t  dt. 
i=l     k=l  T 
k/i 


(20) 


B     =  2B  +  A, 

g 


(21) 


where 

A   >>  1/T, 


(22) 
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then  R(x)    is  essentially  zero,   since  none  of  the  integrands  have  frequency  components  near 

zero . 


The  result  I(x)   is  an  interesting  function  in  itself.     Considering  f j  ( • )    g^ ( • )   as  the 
ith  element  of  a  vector,    (19)    represents  the  average  of  the  norm  of  such  a  vector,   for  each 
value  of  x. 

The  coherent  detection  scheme  described  earlier  may  be  applied  here  to  produce  the 
integrals 

N 

P(x)   =  j      ^   Fi(t+x/v)    Qj*  (t-x/v)   dt.  (23) 
T  1=1 

The  integrand  is  the  inner  product  between  the  vectors 

F(t+x/v)   =    (FjU+x/v),  FN(t+x/v))  (24) 

and 

G  (t-x/v)    =    (Gj (t-x/v),  GN(t-x/v)).  (25) 

Thus , 

P(x)    =   |  F(t+x/v)    •   G(t-x/v)    dt.  (26) 
T 

Note  that  if  the  integrating  detector  array  is  replaced  by  a  fast  detector  array,  the  norms 
of  the  componentwise  products, 


^  I fi (t+x/v)    g. (t-x/v) | 2, 


i=l 

are  available. 


Architecture  5  -  inner  product  triple  product  processor 

Architectures  described  so  far  have  used  Bragg  cells  which  lie  in  the  same  plane.  By 
orienting  one  Bragg  cell  so  that  the  acoustic  propagation  is  vertical,   many  FDM  processors 
may  be  built    (including  ones  which  duplicate  the  function  of  previously  described 
architectures) .     The  architectures  to  be  described  here  are  vector  extensions  of  the  triple 
product  processor1*.     One  such  extension  is  pictured  in  Figure  5. 

A  light  source  bearing  the  intensity  modulation  p(t)    is  collimated  by  LI  and  forced  to 
converge  vertically  by  L2  to  illuminate  cell  A01 .     Here,   the  light  receives  modulation 
[f (t-x/v) ]+,  where  x  is  the  horizontal  position  coordinate.     Cells  A01  and  A02  are  located 
in  the  front  and  back  focal  planes  of  spherical  lens  L3,   respectively.     Light  diffracted  by 
A01  illuminates  A02  and  is  modulated  by   [g(t-y/v)]+,  where  y  is  the  vertical  position  in  the 
plane  of  A02 .     The  doubly  diffracted  light  passes  through  spherical  lens  L4  and  cylindrical 
lenses  L5  and  L6  before  reaching  the  plane  of  F2 .     L4  effects  a  transform  vertically,  so 
that  the  vertical  variation  of  light  amplitude  seen  at  Fl  is  the  Fourier  transform  of 
[g(t-y/v)]+  with  respect  to  y.     Lenses  L4  and  L5  perform  a  similar  horizontal  transform 
whose  phase  is  corrected  by  L6 .     Thus,   the  amplitude  distribution  seen  at  F2  is  the  two- 
dimensional  transform  of   [f (t-x/v) ]+  [g(t-y/v)]+  with  respect  to  x  and  y. 

As  observed  before,   the  contributions  from  each  fi  will  appear  horizontally  disjoint, 
while  those  from  each  g^  will  be  vertically  disjoint,   in  the  plane  of  F2 .     F2  is  identical 
to  the  F2  used  earlier,    i.e.,    it  admits  light  only  through  rectangles  on  its  diagonal. 
Thus,   only  the  contributions  from  products  of  the  form  f^ (t-x/v)   g-j_(t-y/v)   pass  the  filter. 

Light  leaving  F2  travels  the  focal  length  of  spherical  lens  L7,   passes  through  L7  and 
illuminates  a  two-dimensional  integrating  detector  located  in  the  back  focal  plane  of  L7 . 
This  effects  a  two-dimensional  transform,   causing  the  image  of   [f (t-x/v) ]+  [g(t-y/v)]+  to 
be  reconstructed  on  the  detector,   but  without  the  crossproducts  removed  by  F2.  The 
amplitude  distribution  at  the  detector  is  clearly 
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N 

A(x,y)  =     2   ej2nib°(2t-x/v-y/v)   /pit)   f.(t-x/v)   g.(t-y/v).  (27) 
i  =  l 

The  detector  pixel  at    (x,y)    integrates    |A(x,y)|2  and  produces 


I(x,y)   =  |  p(t)    ^    |fi(t-x/v)   gi(t-y/v)|2  dt, 


i=l 

where  the  non-contributing  terms  have  been  dropped,  as  was  done  earlier.  Again,  (28; 
describes  an  interesting  function  bearing  a  similar  interpretation  to    (19) . 


P(x,y)   =  j  p(t)    ^      Fi(t-x/v)  Gi(t-y/v) 


(28) 


By  using  the  coherent  detection  scheme,    the  integrals 
N 

dt  (29) 

x  j.  - 

T  i  =  l 


may  be  generated.  Again,  the  summation  can  be  viewed  as  an  inner  product.  This  is  an 
extension  of  the  usual  triple  product  processor  output  having  the  form 


p(t)   f(t-x)   g(t-T2)   dt.  (30) 


T 


Architecture  6  -  parallel- load  triple  product  processor 

The  final  architecture  to  be  presented  is  a  fully  degenerate  form  of  FDM  processor. 
Here,  N  is  chosen  to  be  large,    forcing  the  Bragg  cells  to  be  short.     Bragg  cell  time 
aperture  serves  only  to  distinguish  between  components  of  f  and  of  g.     The  intent  is  to 
produce  the  outputs 


P(t) 


fi(t)  gk(t) 


dt 


for  all  i,k  e{l,  N }  simultaneously. 

The  processor  is  a  modification  of  architecture  5,   as  shown  in  Figure  6.     As  before,  the 
transform  of  each  product  f ^ (t-x/v)    gk(t-y/v)   appears  in  a  different  location  in  the  plane 
of  F2 ,  with  i  determined  by  horizontal  position  and  k  determined  by  vertical  position.  The 
mask  F2  has  been  replaced  by  a  two-dimensional  integrating  detector  array.     The  pixels  are 
positioned  such  that  detector   (i,k)    collects  exposure 
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I(i,k)  =  j  P(t)  [j   |fi(t-i)|2  dt]  [  |   |gi(t-x)|2  di]  dt. 


(31) 


T 


Since  To  is  short,  I(i,k)  may  be  approximated  by  assuming  fj_  and  gk  to  be  constant  over  any 
integral  of  length  TQ .  Thus, 


I(i,k)    =  TB2      |  p(t)     | f i (t-Tg/2 ) | 2    |gk(t-TB/2) | 2  dt. 


(32) 


The  goodness  of  the  approximation  (32)  depends  upon  B  .  As  usual,  use  of  a  coherent 
detection  scheme  allows  the  outputs 


P(i,k)   =       p(t)    Fi(t-TB/2)    GR(t-TB/2)    dt  (33) 


T 


to  be  produced. 

A  comparison  of   (30)   and   (33)    shows  that  this  is  a  conventional  triple  product  processor 
allowing  parallel  inputs,   rather  than  restricting  all  of  the  F±  to  be  time  delays  of  a 
single  function. 

Conclusion 

This  paper  has  presented  a  representative  sample  of  processors  employing  FDM  inputs. 
The  FDM  approach  provides  a  generalization  of  conventional  correlators,    leading  to 
increased  flexibility  and  new  operations,   at  the  expense  of  some  time-bandwidth  product. 
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Abstract 

A  class  of  broadband,   linear  acousto-optic  filters  whose  frequency  response  can  be 
programmed  or  adapted  directly  in  the  frequency  domain  is  described.     A  practical,  compact 
architecture  for  these  broadband  linear  filters  is  presented.     Experimental  measurements  of 
the  dynamic  range  and  filter  response  functions  are  presented  and  compared  with  theory. 
Applications  for  interference  excision,  wideband  recording,  wideband  digitizing,   fast  fre- 
quency synthesis,   and  fast  tracking  superhet  receivers  are  outlined. 

Introduction 

A  variety  of  broadband  linear  filters  can  be  constructed  by  combining  acousto-optic 
modulation  with  optical  heterodyne  detection  techniques.     One  important  class  of  such  linear 
acousto-optic  filters  has  a  high  Q,   RF  frequency  response  which  may  be  programmed  or  adapted 
directly  in  the  frequency  domain.     Over  the  past  several  years,   Probe  Systems  has  examined 
the  architectures,   theoretical  performance,   measured  performance  and  systems  applications 
for  this  class  of  linear  acousto-optic  filters.     The  results  of  this  effort  are  summarized 
below  and  indicate  that  the  technology  is  extremely  attractive  for  high  performance  program- 
mable and  adaptive  filtering  of  broadband  signals. 

The  functional  concept  of  a  programmable  linear  acousto-optic  filter  is  shown  in  Figure  1. 
The  filter  accepts  an  electronic  RF  input  s(t)    and  produces  an  electronic  RF  output 
s(t)  ©h(t).     The  filter  impulse  response  h(t)    or  frequency  response  H(f)   may  be  programmed 
by  an  external  electronic  interface.     For  the  class  of  programmable  linear  acousto-optic 
filters  considered  here,   the  filter  magnitude  response   |H(f)|    can  be  programmed  directly  as 
a  function  of  frequency  f. 

The  adaptive  linear  acousto-optic  filter  is  similar  to  the  programmable  linear  acousto- 
optic  filter  except  that  the  filter  contains  a  means  of  measuring  parameters  of  the  RF  input 
or  RF  output  signal.   These  parameter  measurements  are  used  to  adjust  the  filter  response  to 
provide  optimum  performance  for  the  particular  signal  environment  and  post-filter  processing. 
For  the  class  of  adaptive  linear  acousto-optic  filters  discussed  here,   the  signal  parameter 
measurements  are  restricted  to  power  spectrum  measurements  which  in  turn  are  used  to  adjust 
the  filter  magnitude  response   | H  <  f ) |  . 

Filter  architecture  and  models 

The  basic  architecture  which  has  been  developed  for  linear  acousto-optic  filtering  with 
frequency  domain  control  is  shown  in  Figure  2.     This  architecture  uses  an  optical  layout 
which  is  similar  to  yet  distinctly  different  than  the  layout  of  an  acousto-optic  spectrum 
analyzer.     Each  frequency  of  the  RF  input  creates  a  translating  grating  in  the  Bragg  cell. 
Each  grating  in  turn  generates  a  diffracted  optical  signal  beam  which  is  focused  to  a  spot 
in  the  transform  plane  of  the  processor.     The  spatial  position  of  each  spot  in  the  transform 
plane  is  proportional  to  the  RF  input  frequency.     Also,   each  diffracted  spot  has  an  optical 
doppler  shift  which  is  equal  to  the  corresponding  RF  input  frequency. 

The  processor  of  Figure  2  is  linear  as  a  consequence  of  employing  optical  heterodyne 
photodetection.     This  heterodyne  detection  is  achieved  by  including  a  weakly  diffracting 
hologram  in  the  optical  path  as  shown  in  Figure  2.     The  hologram  diffracts  a  small  portion 
of  the  undiffracted  optical  beam  to  create  an  optical  reference  beam  which  is  not  doppler 
shifted.     The  sum  of  this  holographic  reference  beam  and  the  acousto-optically  diffracted 
signal  beam  are  detected  by  a  wideband,   spatially-integrating,   output  photodetector . 

The  square-law  photodetection  produces  a  heterodyne  output  signal  which  oscillates  at 
the  RF  input  frequency  as  a  consequence  of  the  relative  doppler  shift  between  the 
diffracted  signal  and  reference  beams.     More  generally,   this  heterodyne  RF  output  signal 
must  be  a  linearly  filtered  version  of  the  RF  input  signal  for  low-level,  acousto-optic 
modulation. 1     Assuming  in  Figure  2  that  the  transform-plane  spatial  light  modulator  is 
removed,   the  linear  frequency  response  from  the  RF  input  to  the  RF  output  is  determined 
primarily  by  the  amplitude  and  phase  curvature  of  the  holographic  reference  beam.  By 
synthesizing  an  appropriate  hologram,  one  can  create  a  frequency  response  having  a  flat 
magnitude  and  linear  phase  over  the  RF  input  bandwidth.     This  frequency  response  corresponds 


SPIE  Vol.  34 1  Real  Time  Signal  Processing  V  (1982)  /   1 73 


to  a  fixed  broadband  time  delay.  Successful  fabrication  of  holograms  to  generate  the  fixed, 
broadband  time  delay  response  has  been  performed  in  previous  work.2'3 

The  fixed,  broadband  time  delay  filter  can  be  converted  to  a  programmable  or  adaptive 
filter  by  incorporating  a  transform-plane  spatial  light  modulator  as  shown  in  Figure  2. 
The  spatial  light  modulator  allows  one  to  optically  block  specific  frequencies  from 
reaching  the  output  photodetector .     The  blocked  frequencies  result  in  stopbands  in  the 
linear  frequency  response  of  the  filter  so  that  high  Q  notch  and  passband  filter  functions 
can  be  created. 

With  an  electronically  programmable  spatial  light  modulator  in  the  transform  plane,  one 
obtains  a  channelized  programmable  filter.     A  model  for  this  type  of  programmable  filter  is 
shown  in  Figure  3.     The  model  shows  that  the  RF  input  is  divided  into  N  passband  filter 
channels  whose  outputs  can  be  individually  switched  for  addition  to  the  RF  output.     A  key 
property  of  the  filter  bank  is  that  adjacent  filter  channels  are  contiguous  in  the  sense 
that  adding  adjacent  channel  outputs  will  result  in  a  wider  passband  filter  response  with 
no  filter  gaps  or  phase  distortion.     Also,   individual  filter  channels  have  essentially  no 
phase  distortion  even  at  the  passband  edges  if  the  filter  time  delay  is  appropriately 
chosen.1'1*     The  number  of  potentially  resolvable  filter  channels  is  approximately  equal  to 
the  time-bandwidth  product  of  the  apodized  Bragg  cell.     The  actual  number  N  of  filter 
channels  is  determined  by  the  number  of  spatial  light  modulator  elements. 

As  opposed  to  using  a  programmable  spatial  light  modulator  in  the  transform  plane,  one 
may  instead  use  a  "self-adaptive"  spatial  light  modulator  whose  optical  intensity  trans- 
mission is  a  function  of  the  incident  optical  intensity.     This  results  in  an  adaptive  filter 
whose  approximate  model  is  shown  in  Figure  4.     In  this  model,   the  RF  input  is  divided  into 
a  number  N  of  filter  channels  where  the  gain  in  each  channel  is  a  function  of  the  average 
signal  power  in  that  channel.     The  channel  outputs  are  then  summed  to  provide  the  RF  output. 
In  this  manner,  one  can  create  a  broadband  processor  with  "channelized"  automatic  gain 
control . 

As  one  important  example,   the  adaptive,   channelized  gain  control  may  be  used  to  reject 
strong  narrowband  interference  signals   from  broadband  signals  of  interest.     This  requires 
an  adaptive  spatial  light  modulator  which  functions  as  an  "optical  inverter."     The  optical 
inverter  is  transparent  for  low-level  optical  intensities  but  becomes  optically  opaque  for 
high-level  optical  intensities.     This  results  in  a  channelized  adaptive  filter  whose 
channel  gain  is  fixed  for  low-level  signals  but  drops  to  zero  when  the  channel  signal  power 
exceeds  a  set  threshold.     The  resultant  filter  adaptively  rejects  strong  narrowband  inter- 
ference from  the  broadband  signal  of  interest  with  minimal  broadband  signal  distortion. 
Such  narrowband  interference  "excision"   filters  have  received  considerable  development 
attention1-9  and  have  great  utility  in  the  processing  of  wideband  spread  spectrum  signals 
which  are  subject  to  narrowband  interference  jamming. 

As  a  final  note,   the  holographic  interferometer  architecture  of  Figure  2  is  a  major 
advance  over  previously  used  Mach-Zehnder  interferometer  architectures9-11   for  obtaining 
stable  and  compact  linear  acousto-optic  filters.     The  Mach-Zehnder  architectures  are  subject 
to  vibrational  phase  noise  in  the  RF  output  due  to  the  physically  separate  optical  paths  of 
the  signal  and  reference  beams.     The  holographic  architecture  of  Figure  2  employs  a  common 
optical  path  for  the  signal  and  reference  beams  so  that  vibrational  phase  noise  is  avoided. 
Also,   the  single  optical  path  of  the  holographic  architecture  results  in  a  simple,  compact 
processor  so  that  practical  field  units  can  be  built. 

Filter  Rejection  Depth 

The  relative  filter  depth  between  the  pass  and  stop  bands  of  the  linear  acousto-optic 
filter  is  determined  by  a  variety  of  factors.     For  good  overall  filter  depth,   it  is 
desirable  to  apodize  the  Bragg  cell  aperture  such  that  its  Fourier  transform  results  in  a 
low-sidelobe  optical  spot  in  the  transform  plane.     The  absence  of  sidelobes  prevents  spill- 
over of  signal  energy  into  adjacent  frequency  channels.     This  signal  energy  spillover  will 
in  general  reduce  the  filter  depth. 

A  well  apodized  Bragg  cell  does  not  guarantee  good  filter  depth,   however.     Optical  dis- 
tortions and  scattering  from  the  optics  can  reduce  the  filter  depth.     Broad  laser  linewidths 
and  spontaneous  emission  can  also  reduce  the  filter  depth  and  frequency  resolution. 
Finally,   the  overall  filter  depth  cannot  exceed  20  log,Q(c)    dB  where  c  is  the  optical 
intensity  contrast  ratio  of  the  transform-plane  spatial  light  modulator. 1 

Experiments  were  performed  to  measure  the  filter  depth  for  notch  and  passband  filter 
functions  of  the  linear  acousto-optic  filter.     The  particular  experimental  results  to  be 
shown  here  utilized  a  helium  neon  laser  with  a  Mach-Zehnder  interferometer  architecture. 
The  notch  filter  was  implemented  by  placing  a  thin  wire  in  the  transform  plane  of  the 
linear  acousto-optic  filter.     The  passband  filter  was  implemented  by  using  a  razor  blade 
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aperture  in  the  transform  plane.     The  wire  and  razor  blade  apertures  effectively  simulate 
ideal,   infinite-contrast-ratio  spatial  light  modulators. 

The  notch  filter  experiments  were  performed  using  a  Bragg  cell  having  20  megahertz  of 
bandwidth  and  a  full  aperture  time-bandwidth  product  of  125.     The  Bragg  cell  aperture 
was  apodized  to  a  nearly  gaussian  shape  using  the  natural  laser  beam  profile.  The 
measured  filter  magnitude  response  for  the  notch  filter  is  shown  in  Figures  5A  and  5B. 
The  filter  response  is  seen  to  be  fairly  flat  over  the  20  megahertz  bandwidth  of  the  Bragg 
cell  except  for  the  sharp  notch  filter  at  35  megahertz.     The  notch  filter  depth  varies 
from  40  to  48  dB  at  the  notch  center.     Experiments  with  slightly  wider  notches  demonstrated 
at  least  40  dB  of  filter  rejection  within  the  notch.     The  width  of  the  transform-plane  wire 
for  the  results  shown  in  Figures  5A  and  5B  corresponds  to  1/33  of  the  net  20  megahertz 
bandwidth . 

The  bandpass  filter  experiments  were  performed  using  a  Bragg  cell  having  45  megahertz 
of  bandwidth  and  a  full  aperture  time-bandwidth  product  of  400.     As  before,   the  Bragg  cell 
aperture  was  apodized  to  a  nearly  gaussian  shape.     The  measured  magnitude  response  for  the 
full  bandwidth  filter  is  shown  in  Figure  6A.     One  can  see  that  the  full  bandwidth  filter 
response  is  reasonably  flat  over  45  megahertz  of  bandwidth.     By  placing  a  razor  blade 
aperture  in  the  transform  plane,   the  narrow  passband  filter  response  of  Figure  6B  was 
obtained.     This  passband  filter  response  shows  approximately  50  dB  of  rejection  near  the 
filter  edges  with  nearly  60  dB  of  rejection  farther  from  the  filter  edges. 

In  these  and  other  experimental  measurements,   the  passband  filters  showed  significantly 
better  filter  rejection  than  for  notch  filters.     As  a  consequence,  optical  scattering  was 
suspected  as  being  a  major  limitation  on  filter  depth  since  passband  filter  apertures 
allow  less  optical  scattering  to  reach  the  photodetector  than  notch  filter  apertures. 
While  the  filter  depth  results  obtained  here  are  impressive,   even  higher  filter  depth 
results  should  be  attainable  as  sources  of  optical  scatter  are  identified  and  minimized. 

The   filter  depth  measurements  of  Figures  5  and  6  demonstrate  the  currently  achievable 
filter  depth  for  ideal,   infinite-contrast-ratio  spatial  light  modulators.     A  variety  of 
spatial  light  modulator  technologies  have  demonstrated  optical  intensity  contrast  ratios 
in  the  range  of  100-1000  corresponding  to  filter  depths  in  the  range  of  40-60  dB .  Also, 
a  number  of  experimental  results  have  previously  been  reported  for  adaptive5'6  and  program- 
mable9  spatial  light  modulators  with  encouraging  results.     It  is  reasonable  to  expect 
that  programmable  and  adaptive  spatial  light  modulators  obtained  from  modifications  of 
current  technology  will  in  the  near  future  provide  the  filter  depths  indicated  in  Figures  5 
and  6  . 

Dynamic  Range 

When  viewed  as  a  module,   the  linear  acousto-optic  filter  is  very  similar  to  a  broadband 
electronic  amplifier  combined  with  a  high-Q  filter  network.     Like  the  electronic  amplifier, 
the  linear  acousto-optic  filter  has  sources  of  internal  noise  as  well  as  limits  on  the  RF 
signal  level  so  that  its  dynamic  range  is  limited.     In  analogy  to  an  electronic  amplifier 
with  an  octave  or  less  of  signal  bandwidth,   the  dynamic  range  of  a  linear  acousto-optic 
filter  can  be  described  by  specifying  the  relative  output  power  levels  of  the  fundamental 
signal,   the  noise,   and  the  two- tone,   third-order  intermodulation . 

The  output  signal-to-noise  ratio  for  linear  acousto-optic  filters  has  been  expressed  and 
optimized  in  previous  work.1     Figure  7  show  the  theoretical  signal,  wideband  noise  and 
intermodulation  power  levels  for  such  an  optimized  linear  acousto-optic  filter.  This 
theoretical  data  assumes  a  net  filter  bandwidth  of  45  megahertz,   a  Bragg  cell  time- 
bandwidth  product  of  400,   a  7  milliwatt  HeNe  laser,   and  a  PIN  photodiode  detector  with  a 
50%  quantum  efficiency.     The  saturated  signal  level  is  assumed  to  be  determined  by  input 
power  limits  to  the  Bragg  cell  such  that  the  1  dB  compression  point  occurs  at  a  diffraction 
efficiency  of  10%.     (In  actual  practise,   the  Bragg  cell  saturates  at  a  much  higher  level.) 

From  Figure  7,  one  can  see  that  the  theoretical   1  dB  compression  point  for  the  linear 
acousto-optic  filter  occurs  at  a  "relative"  output  power  of  45.5  dB .     The  net  noise  power 
increases  for  signals  near  the  1  dB  compression  power  level  due  to  signal  dependent 
optical  shot  noise.     As  a  result,   the  theoretical  output  signal-to-noise  ratio  at  the  1  dB 
compression  point  is  approximately  43.5  dB. 

The  theoretical  two-tone,   third-order  intermodulation  power  for  the  linear  acousto-optic 
filter  is  also  plotted  in  Figure  7.     This  intermodulation  is  assumed  to  be  caused  by  the 
acousto-optic  diffraction  process  of  the  Bragg  cell.     The  third  order  intercept  point 
occurs  at  a  power  level  of  64.5  dB  relative  to  the  noise  floor  so  that  the  theoretical 
spurious  free  dynamic  range  of  the  filter  is    (2/3)    x  64.5  =  4  3  dB.12     The  spurious  free 
dynamic  range  represents  the  maximum  output  signal-to-noise  ratio  for  which  the  two-tone, 
third-order  intermodulation  products  do  not  exceed  the  net  broadband  noise  power. 
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The  theoretical  dynamic  range  data  of  Figure  7  can  be  compared  with  the  actual 
experimental  dynamic  range  data  of  Figure  8.     The  experimental  data  of  Figure  8  was 
collected  using  the  same  linear  acousto-optic  filter  having  the  full  45  megahertz  of 
bandwidth  shown  in  Figure  6A.     The  conditions  for  the  experimental  data  of  Figure  8  are 
essentially  the  same  as  the  conditions  for  the  theoretical  data  of  Figure  7  except  that 
an  avalanche  photodiode  detector  is  used  rather  than  a  PIN  photodiode  detector.     As  one 
consequence,  the  saturated  signal  level  for  the  experimental  results  is  determined  by  the 
input  power  limits  of  the  avalanche  photodiode  rather  than  the  Bragg  cell.     Another  con- 
sequence of  using  the  avalanche  photodiode  rather  than  a  PIN  photodiode  is  excess  shot 
noise . 

The  experimental  dynamic  range  data  of  Figure  8  shows  an  output  signal-to-noise  ratio 
at  the  1  dB  compression  point  of  39-1  =  38  dB .     This  is  5.5  dB  lower  than  the  optimized 
theoretical  value  of  4  3.5  dB .     The  lower  signal-to-noise  ratio  for  the  experimental  results 
can  be  attributed  to  factors  such  as  excess  shot  noise  from  the  avalanche  photodetector  and 
optical  reflection  losses.     The  third-order  intercept  point  for  the  experimental  data  of 
Figure  8  is  at  54  dB  relative  to  the  wideband  noise  power.     This  represents  a  spurious 
free  dynamic  range  of  36  dB  which  is   7  dB  lower  than  the  theoretical  value.     The  measured 
spurious  free  dynamic  range  is  worse  than  one  would  expect  from  the  nonlinearities  of  the 
acousto-optic  diffraction.     The  excess  third-order  intermodulation  was  caused  by  non- 
linearities  in  the  input  and  output  amplifier  electronics  and  should  be  removable  by  proper 
system  design. 

The  experimental  results  were  in  reasonably  close  agreement  with  the  theoretical  results 
when  the  experimental  deficiencies  are  accounted  for.     The  experimental  dynamic  range  can 
be  expected  to  closer  match  the  theoretical  dynamic  range  predictions  when  individual  com- 
ponents are  optimized.     For  systems  requiring  higher  dynamic  range  than  presented  here, 
one  can  either  use  a  laser  having  a  higher  optical  power  P  or  sacrifice  filter  resolution 
by  using  a  lower  time-bandwidth  product  TB  for  filtering.     The  maximum  output  signal-to- 
noise  ratio  will  increase  in  proportion  to  P/CTB).1 

Applications 

The  linear  acousto-optic  filters  described  above  have  a  number  of  important  advantages 
over  other  technologies  for  wideband  programmable  and  adaptive  filtering  applications. 
The  linear  acousto-optic  filter  technology  is  capable  of  much  wider  bandwidths  than 
currently  available  digital  or  CCD  filters.     Also,   the  linear  acousto-optic  filters  avoid 
capacitive  loading  and  coupling  problems  associated  with  the  electronic  switch  arrays  used 
in  CCD  and  SAW  tapped  delay  line  filters.     When  compared  to  SAW  chirp  filters,   the  linear 
acousto-optic  filters  offer  superior  phase  linearity,   avoid  timing  and  synchronization 
problems,   provide  simultaneous  I  and  Q  channel  filtering,   and  do  not  generate  wideband 
spurious  signals13   from  the  filter  sidelobe  energy. 

One  of  the  important  applications  for  linear  acousto-optic  filters  involves  the  adaptive 
rejection  of  narrowband  interference  signals  from  broadband  communications  and  radar 
signals.     As  described  above,   this  type  of  adaptive  filter  can  be  implemented  by  using  an 
"optical  inverter"  spatial  light  modulator  in  the  processor  transform  plane  to  block  the 
high-intensity,   narrowband  signals  while  passing  the  broadband  signals.     A  typical  applica- 
tion for  narrowband  interference  rejection  in  spread  spectrum  communications  is  shown  in 
Figure  9.     The  adaptive  linear  acousto-optic  filter  precedes  the  spread  spectrum  receiver 
to  reject  the  narrowband  jammer  signals.     The  notch  filters  created  by  the  linear  acousto- 
optic  filter  are  extremely  narrow  with  essentially  no  phase  distortion  so  that  the  broad- 
band signal  undergoes  a  minimal  amount  of  distortion.     Rejection  of  the  narrowband  jammers 
provides  improved  receiver  detection  and  synchronization  sensitivity.8'11* 

Another  important  application  area  for  linear  acousto-optic  filters  is  in  wideband 
signal  recording  or  digitizing.     The  dynamic  range  of  analog  signal  recorders  or  digitizers 
tends  to  decrease  as  the  bandwidth  increases.     To  compensate  for  the  limited  dynamic  range, 
a  programmable  or  adaptive  linear  acousto-optic  filter  can  precede  the  recorder  or  digitizer 
as  shown  in  Figure  10.     An  adaptive  linear  acousto-optic  filter  could  provide  frequency 
channelized  automatic  gain  control  to  ensure  that  the  signal (s)   being  collected  are  within 
the  dynamic  range  limits  of  the  recorder  or  digitizer.     The  frequency  channelized  gain 
control  allows  one  to  simultaneously  record  high-level  and  low-level  signals  without  broad- 
band intermodulation.     Alternatively,  one  may  utilize  a  programmable  linear  acousto-optic 
filter  to  match  the  receive  filter  bandwidth  and  center  frequency  to  the  desired  signal 
while  also  providing  narrowband  interference  rejection.     An  additional  benefit  of  the  linear 
acousto-optic  filter  is  that  the  extremely  sharp,   linear-phase  filter  skirts  provide  ideal 
antialias  filtering  for  the  digitizer  and  allow  one  to  digitize  signal  bandwidths  near  the 
theoretical  limit  of  one-half  the  sample  rate. 
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The  programmable  linear  acousto-optic  filter  can  be  used  as  a  system  building  block  in 
advanced  spread  spectrum  communications  and  intercept  systems.     Figure  11  shows  such  an 
example  for  fast  hopping  frequency  synthesis.     The  programmable  filter  is  driven  with  an 
impulse  train  which  is  phase  locked  to  a  stable  crystal  source.     In  the  frequency  domain, 
the  impulse  train  consists  of  a  set  of  equally-spaced  CW  tones  with  equal  amplitudes  and 
phases.     The  programmable  filter  allows  one  to  create  a  bandpass  filter  to  select  one  of 
the  CW  tones  while  rejecting  the  other  tones.     Assuming  the  requisite  high-speed  program- 
mable optical  switch  can  be  constructed,   this  method  of  direct  frequency  synthesis  can 
simultaneously  provide  rapid  frequency  hopping  and  frequency  stability.     This  would  be  a 
major  advance  over  conventional  phase-locked  loop  synthesis  techniques  whose  programming 
speed  is  inversely  proportional  to  the  settling  time  and  phase  noise  of  the  generated  CW 
tones . 1 5 


Another  example  of  using  the  programmable  linear  acousto-optic  filter  as  a  system 
building  block  is  shown  in  Figure  12.     This  figure  shows  the  implementation  of  a  fast 
tracking  superhet  receiver.     The  local  oscillator   (LO)    for  the  superhet  receiver  is  gen- 
erated using  the  fast  frequency  synthesis  techniques  previously  shown  in  Figure  11.  A 
programmable  linear  acousto-optic  filter  is  also  used  as  a  tuning  or  image  rejection  filter 
on  the  RF  input.     The  frequency  skirts  of  this  tuning  filter  are  so  sharp  and  deep  that  an 
intermediate  IF  filter  is  not  required.     Electronic  mixing  of  the  filtered  RF  with  the 
selected  local  oscillator  frequency  converts  the  signal  directly  to  baseband.     The  tracking 
rate  of  this  superhet  receiver  is  potentially  much  greater  than  for  conventional  superhet 
receivers  whose  tracking  speed  is  limited  by  the  bandwidth  of  the  IF  filter  as  well  as  the 
local  oscillator  settling  time. 


Conclusions 


The  linear  acousto-optic  filter  technology  has  demonstrated  extremely  high  resolution 
filtering  with  excellent  filter  depth  and  dynamic  range.     The  technology  is  very  attractive 
for  wideband  programmable  and  adpative  filtering  in  a  number  of  important  systems  applica- 
tions.    Appropriate  spatial  light  modulator  development  should  result  in  broadband  program- 
mable and  adaptive  field  demonstration  units  in  the  near  future. 
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Figure  1.     The  functional  concept  of  a  programmable 
linear  acousto-optic  filter. 
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Figure  2.     Linear  acousto-optic  filter  architecture 
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30         35        40        45         50    MHz  33         34         35         36         37  MHZ 

(a)    Magnitude  response  (b)    Magnitude  response,  expanded  view 

Figure  5.     Notch  filter  frequency  response  for  a  linear  acousto-optic  filter  using  nearly 
gaussian  apodization.      (Reprinted  from  Ref.    4,   courtesy  of  Marcel  Dekker,  Inc.) 


50  60  70  80  90  MHz 
(a)    Magnitude  response,  full  bandwidth 


50  60  70  80  90  MHz 
(b)    Magnitude  response,  narrow  passband 


Figure  6.     Full  bandwidth  and  narrow  passband  frequency  responses 
for  a  linear  acousto-optic  filter. 


Figure  7.     Theoretical  dynamic  range  for  a 
45-MHz  bandwidth  linear  acousto-optic 
filter.      (7  mW  HeNe  laser,  TB  =  400) . 


Figure  8.     Measured  dynamic  range  for  a 
4  5-MHz  bandwidth  linear  acousto-optic 
filter.      (7  mW  HeNe  laser,  TB  =  400) . 
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ABSTRACT 

An  iterative  optical  matrix-vector  processor  is  described  and  its  use  in  computing  the 
adaptive  weights  for  a  multi-dimensional  phased  array  radar  is  detailed.     Experimental  data 
is  provided  and  the  accuracy  of  the  processor  is  discussed.     A  systolic  version  of  the  pro- 
cessor requiring  only  a  1-D  acousto-optic  transducer  is  also  described. 

1.  INTRODUCTION 


Adaptive  phased  array  radar    (APAR)    represents  a  formidable  signal  processing  problem  [1-3] 
of  considerable  current  interest   [4-5]    and  for  which  advanced  signal  processing  systems  such 
as  optical  processors  appear  quite  appropriate.     The  real-time  and  parallel  processing  fea- 
tures of  optical  processors  make     such  systems  attractive  candidates  for  this  application. 
However,   since  the  processing  in  an  adaptive  radar  requires  matrix  inversions  and  similar 
linear  algebraic  operations,   a  new  type  of  optical  processor  that  is  more  general  than  the 
conventional  optical  Fourier  transform  and  correlation  systems  is  necessary.     In  this  paper, 
we  describe  a  new  and  general-purpose  optical  processor  and  we  demonstrate  and  discuss  its 
use  and  performance  as  an  adaptive  radar  processor. 

In  Section  2,  we  describe  the  processing  required  for  APAR.     We  then  describe   (Section  3) 
an  iterative  matrix-vector  optical  processor  we  have  fabricated.     In  Section  4,  we  provide 
initial  experimental  results  obtained  on  our  laboratory  iterative  optical  system.     This  in- 
cludes the  use  of  this  system  in  calculating  the  adaptive  weights  necessary  to  cancel  noise 
sources  distributed  in  angle  and  to  achieve  multi-dimensional  adaptive  antenna  processing 
with  interference  sources  distributed  in  both  time  and  space.     The  accuracy  and  performance 
of  the  system  are  addressed  in  Section  5  and  our  summary  and  conclusions  are  advanced  in 
Section  6.     In  Appendix  A,  we  show  that  the  same  matrix-vector  equation  results  whether  the 
antenna's  output  SNR  is  maximized  or  whether  the  mean-square  error  between  the  signal  and 
the  array  output  is  minimized.     In  Appendix  B,  we  describe  an  optical  systolic-array  proces- 
sor architecture  that  can  also  solve  simultaneous  linear  equations.     This  system   [6]    is  at- 
tractive since  it  only  requires  a  1-D  AO  transducer  rather  than  a  real-time  and  reuseable 
2-D  spatial  light  modulator. 

2.      APAR  PROCESSING 

For  simplicity,  we  consider  a  linear  phased  array  antenna  with  N  isotropic  elements  spaced 
D  =  AR/2    (see  Figure  1)  .     We  assume  a  signal  s  ( t)  exp  ( juot)    in  the  far  field  at  an  angle  6g 
together  with  M  uncorrelated ,    zero-mean,   narrow-band  interference  sources  r^ ( t) exp ( jwt)  at 
angles  6m.     The  objective  of  an  APAR  is  to  maximize  the  antenna's  response  m  the  direction 
6q  and  to  minimize  its  response  in  the  directions  6m  of  the  interference  sources.     Since  all 
sources  are  in  the  far  field,   and  since  the  path  difference  between  signals  received  at  two 
adjacent  antenna  elements  due  to  a  target  at  an  angle  6  is  dsinG ,   the  signal  received  at 
the  n-th  antenna  element  is 

M 

z    (t)    =  s(t)ej(ujt  +  ™sinV    +     z  r   (t)ej(a,t  +  ™sinem>  .  (1) 

m=l  m 

To  achieve  the  desired  antenna  response,   these  N  antenna  outputs  are  multiplied  by  a  set  of 
complex  weights  wn  to  give  a  final  receiver  output 

(N-l) 

v     ^  (t)   =       £     w  z    (t)   =  w  z  (t)  .  (2) 
out  A     n  n  —  - 

n=0 

In  (2)  and  in  future  discussions,  we  use  matrix-vector  notation  to  describe  the  received  sig- 
nals z,   their  covariance  matrix  M,   and  the  set  of  weights  w. 

With  no  interference,  we  can  steer  the  antenna  to  6  =  9Q  by  applying  the  conjugate  phase 
pattern  wn  =  exp  (- j-rrnsinGrj)  .     However,  when  directional  interference  is  present,   the  optimal 
weighting  is  not  this  simple  and  moreover  the  weights  must  be  calculated  adaptively  as  a 
function  of  changes  in  the  RF  environment.     In  Appendix  A,  we  showed  that  the  adaptive  weight 
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FIGURE  1     Simplified  pictorial  block  diagram  of  an  adaptive  phased  array  radar  processor. 


vector  w  that  maximizes  performance  satisfies  the  matrix-vector  equation 

M  w  =  s* , 

that  is 

w  =  M-1s*, 


3a) 


:3b) 


where  M  =  z*  ( t )  z_T  ( t )    is  the  covariance  matrix,   Pq  =  s*  (t)  sT  (t)    is  the  noise  pov/er  and  s  = 
{ exp  ( jTrnsineg )  }  -i^s  the  steering  vector. 

To  extend  this  theory  to  adaptivity  in  velocity    (or  time)    in  addition  to  angle    (or  space), 
we  use  N '   time  taps  on  each  of  the  N  adaptive  antenna  elements.     When  the  proper  set  of 
N  x  N 1  weights  wn  ni   are  applied  to  the  zn  nt   received  signals  at  these  N  *  N '   taps,  adap- 
tivity in  space  and  time  results.     The  analysis  follows  the  simpler  1-D  angular  adaptivity 
case  described  above.     The  received  signal  at  the  antenna  element   (n,n')    is  now 

M 

7         m   _  csm-jtut  +  Trnsin60  +  TT(4T/AR)n'v0]    .     y       f+.x0jEwt  +  Trnsinem  +  tt  (  4x/AR)  n '  vm] 
n,n'^       s^>e  +     L^rm[Z>e  /4) 

m=l 

where  -tt/2  <  9  <  tt/2  and  -(4x/A)  <  v  <  (4x/A)  ,  x  is  the  time-delay  per  tap  and  v  is  velocity. 
The  2-D  steering  vector  is 


_*  -jir[ksin9o  +  (4x/AR)k'v0; 

s  k,k' 


(5) 


and  the  elements  of  the  new  covariance  matrix  (in  terms  of  vn  n>)  and  the  new  adaptive 
weights  satisfy 


k,k' 


(N-l)  (M'-l 

Z  E 
n=0  n=0 


k  ,  k   ,  n ,  n  '  n,n 


One  can  solve  the  matrix-matrix  equation  in   (6)   or  we  can  convert  this  2-D  adaptivity  prob- 
lem into  a  matrix-vector  problem  by  lexographically  ordering  the  received  signals  into  a  new 
vector  z(t)   with  a  corresponding  covariance  matrix  denoted  by  M  and  a  new  steering  vector 
and  weights  described  by  s  and  w.     In  this  case   (6)  becomes 


Mw  =  s* 


(7) 


3.      ITERATIVE  OPTICAL  PROCESSOR 

As  shown  in  Section  2,   the  processor  for  APAR  must  be  able  to  perform  matrix-vector  multi- 
plications,  solve  matrix-vector  equations  and  invert  matrices.     Thus,   a  new  and  general-pur- 
pose optical  architecture  is  necessary  if  the  high  speed  and  parallel  processing  features  of 
optical  systems  are  to  be  used  for  this  application.     The  system  of  Figure  2  achieves  the 
required  processing.     The  vector  output  x  from  a  linear  LED  input  array  at  P]_  is  imaged  ver- 
tically and  expanded  horizontally  to  illuminate  a  mask  H  at  P2-     The  light  distribution 
leaving  each  column  at  P2  is  then  summed  on  separate  detector  elements  of  a  linear  photode- 
tector    (PD)    array  at  P3 .     This  P3  output  is  the  matrix-vector  product  Kx   .      This  is  sub- 
tracted from  an  external  vector  y,   the  difference  is  multiplied  by  an  acceleration  parameter 
to  and  added  to  the  prior  input  to  produce  the  next  iterative  input.     We  describe  this  itera- 
tive optical  matrix-vector  processor  by 
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x(j  +  1)   =  x(j)   +  u[Hx(j)   -  y],  (8) 
where  j  denotes  the  iteration  number.     When  x(j)    =  x(j  +  1)   =  x,    (8)    reduces  to 

Hx  =  y  (9) 

and  the  system's  output  is  the  solution 

x  =  H_1y  (10) 

of  the  matrix-vector  equation  in   (9) . 


FIGURE  2     Schematic  diagram  of  the  iterative  optical  matrix-vector  processor  [10]  . 


The  basic  matrix-vector  multiplier  architecture  in  Figure  1  was  most  recently  described 
in   [7],   the  incorporation  of  this  system  with  electronic  feedback  to  produce  an  iterative 
optical  processor  was  noted  in    [8,9] .     In  the  first  version  of  such  an  iterative  optical  ma- 
trix-vector processor    [8],  we  used  I_  -  H  as  the  matrix.     We  later  modified  the  iterative  al- 
gorithm to  include  an  acceleration  parameter  co  as  in   (8)    to  insure  convergence  and  to  speed 
convergence  of  the  iterative  algorithm.     The  laboratory  version  of  this  system  was  recently 
described    [10].     It  uses  a  linear  array  of  LEDs  at  P]_,   fiber  optic  interconnections  between 
P]_  and  P2  and  a  fixed  film  mask  at  P2    (a  system  using  a  CCD-addressed  liquid  crystal  light 
valve  as  the  real-time  P2  mask  was  recently  noted  also  by  us) .     The  height  of  the  P2  mask 
is  5mm  and  this  matches  the  height  of  a  photodetector  element  at  P3,   thus  enabling  us  to 
butt  P2  against  P3  in  the  actual  fabrication  of  the  system.     This  results  in  a  very  small 
and  compact  processor    (four  cubic  inches)   with  more  effective  computing  capacity   (10^  multi- 
plications per  second)    than  most  digital  processors  provide.     With  more  LEDs,   detector  ele- 
ments,  and  a  larger  mask,   this  system  can  achieve  over  10l6  multiplications/seconds.     In  the 
laboratory  system  we  have  fabricated   [10],   a  microprocessor  controller,   hard-wired  multiplier 
and  a  high-speed  ALU  are  provided  in  the  electronic  feedback  system.     Control  of  the  full 
system  from  a  front  panel  is  also  provided  in     a    well-engineered  laboratory  system.  Exten- 
sive memory  and  display  facilities  are  also  provided  to  enable  data  readout  and  analysis  of 
the  performance  of  the  system. 

In   [8],  we  presented  our  first  experimental  APAR  processing  results  obtained  on  our  ini- 
tial laboratory  IOP  system.     A  key  issue  in  the  application  of  this  non-coherent  processor 
to  the  APAR  problem  is  how  the  complex-valued  matrix  and  vector  elements  necessary  for  com- 
puting the  adaptive  weights  can  be  computed  on  the  system  whose  inputs,   outputs  and  matrix 
element  transmittances  are  real  and  positive.     In   [8],  we  used       spatial-multiplexing  in 
which  each  complex  element  was  described  by  its  positive  projections  on  the  0° ,   140°  and 
240°   axes  in  complex  space    (as  first  suggested  in   [11])  .     Our  results  in   [8]    showed  excellent 
agreement  between  theory  and  experiment.     In  our  new  approach  to  handling  complex-valued 
data  on  this  system,  we  operate  the  system  twice,  one  with  positive-valued  inputs  and  on  a 
second  cycle  with  negative-valued  data    (with  the  same  biased  mask) .     We  arrange  the  input 
vector  and  matrix  in  terms  of  the  real  and  imaginary  parts  of  the  data  as 


s* 

M 

-M. 

w 

—  re 

—re 

— im 

—re 

s*  . 

M. 

M 

w . 

—  im 

im 

—re 

_— im 

(11) 


where  each  element  in    (11)    is  complex  and  bipolar.     The  operation  of  this  system  by  subtrac- 
tion of  two  successive  bipolar  outputs  provides  cancellation  of  detector-fixed  pattern  noise 
[10]    and  thus  greatly  improves  the  system's  accuracy  and  performance  while  using  only  twice 
the  space  bandwidth  product  rather  than  three  times  the  space  bandwidth  product  as  required 
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in  our  original  system  in   [8].     This  also  reduces  the  system's  throughput  by  a  factor  of  two, 
while  providing  the  capacity  to  handle  larger  matrices  and  vectors.     Because  of  the  potential 
throughput  of  the  IOP  and  its  limited  space  bandwidth  product   (~  10^  in  1-D  and  10^  in  2-D) , 
such  a  tradeoff  of  time  for  space  appears  merited. 

In   [12],  we  quantified  the  system's  errors.     In   [13],  we  described  our  models  of  these 
errors  and  their  effects  on  the  system's  performance.     We  showed  that  all  fixed-spatial  er- 
rors could  be  lumped  into  one  unifying  single  fixed  system  error  that  can  be  modeled  as  a 
multiplicative  mask  error  at  plane  P?  •     These  errors  can  be  reduced  by  incorporating  the 
necessary  fixed  correction  factors  when  the  P2  mask  is  recorded.     We  also  modeled  temporal 
system  errors    (for  example  detector  noise)   as  an  additive  and  non-correctable  error  source. 
The  observed  output  £  thus  differs  from  the  exact  output  y  with  no  errors  as 


spatial  I  jtemporal 
errors  [        j  errors 


12) 


The  last  term  in  (12)  is  clearly  the  limitation  to  the  system's  performance.  For  our  labor- 
atory system,  we  have  reduced  spatial  errors  to  an  RMS  of  +0.8%  and  detector  temporal  errors 
to  +0.4%. 


4.      APAR  PROCESSING  ON  THE  IOP 


To  quantify  the  performance  and  accuracy  of  the  IOP  for  APAR  processing  and  to  demonstrate 
the  use  of  the  IOP  in  multi-dimensional  APAR  processing    (with  adaptivity  in  both  time  and 
space) ,  we  consider  a  phased  array  with  N  =  2  adaptive  elements  in  space   (or  angle)   with  two 
time  taps  per  element,   i.e.   four  adaptive  weights.     To  handle  this  on  our  new  two-cycle  bi- 
polar IOP  processor,  we  require  2  x  4  or  8  input  LED  elements,   8  output  detector  elements, 
and  an  8  x  8  element  matrix.     This  example  is  chosen  because  its  dimensions  can  be  accomo- 
dated on  our  presently  available  laboratory  IOP  system   (10  x  10  element  mask,   10  LEDs  and  10 
detector  elements) .     We  lexographically  order  the  four  received  signals  as 


80(t) 

=  zoo(t) 

zx(t) 

=  zoi(t) 

z2(t) 

=  z1Q(t) 

z3(t) 

=    zix  (t)  . 

(13) 

We  form  the  new  covariance  matrix  M  in  terms  of  z_  and  the  corresponding  steering  vector  s. 
For  the  example  considered,  N  =  2,  N'   =2,   the  signal  source  is  located  at  On  =  45°  with- a 


velocity  vg  =  0 . 5v 


max 


(where  vmax  =  500)   has  a  power  Pg  =  0.1,   one  interference  source  was 


present  at  90  =  0°  with  velocity  V]_  =  0  and  powe: 
ditive  receiver  noise  was  of  power  Nr  =  1.0. 

The  covariance  matrix  for  this  case  study  was 


=  1.0  (i.e. 


an  SNRi  =  0.1) 


and  the  ad- 


For  this  case  M  is  real, 


where  M> 


,12  11 

M  - 

1  1 
1  1 

It  is  arranged  at  P2  of  Figure  1  as  the  8x8  element  matrix 

0 


(14) 


M 


M 

—re 
0 


M 

—re 


(15) 


-e  is  described  by  (14).  The  Euclidean  norm  from  (14)  is  7.48  and  thus  its  recipro- 
cal is  used  as  the  acceleration  parameter  w  =  0.13  in  (4) .  The  actual  optical  mask  used  at 
P2  of  Figure  1  was  the  matrix  in   (15)   with  all  elements  scaled  by   (h—,,,  -  h,,,.;,,)   =  2  and 


biased  by  hmin/(h, 


=  0.     The  resultant  matrix  is  thus    (15)   with  each  element  divid- 


ed by  two.  The  complex-valued  steering  vector  corresponding  to  the  direction  og 
velocity  vg  =  0.5,  when  arranged  in  our  lexographic  format,  is 


=  45°  and 
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(16) 

0.61 

-  o 

82j 

0  .  27 

+  0 

97  j  j 

The  exogenous  steering  vector  corresponding  to 

(16) 

has 

the 

eight  lexographically-ordered 

elements 

*  "    ^ie^iX  =   ["°-82'   °-97'  °" 

61,  0 

.27, 

-0.61,  - 

■0.27,    -0.82,    0.97]T.  (17) 

The  system  produced  eight  x  outputs  (x^,.. 

These 

are 

the  real  and  imaginary  parts  of 

x  and  are  related  to  the  complex-valued  weights 
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—re  -im 
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(18) 


by 


woo 
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W10 
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+ 

jx? 
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=  x4 

+ 

jXg  . 

(19 

w(50)  = 


20) 


In  Figure  3,  we  show  the  system's  eight  outputs  w(j)    at  iterations  j  =  1,    5  and  50  (steady- 
state)    denoted  by  x(j)    =  x(l),   x(5)    and  x(50)   respectively.     From  Figure  3,  we  find  the 
steady-state  weights  to  be 

HD.9  -  0.45 j 
0.75  -  0.25j 
0.4  -  0.8  j 
0.2     +  1.1  j 

The  resultant  antenna  pattern  obtained  when  the  weights  in    (20)   were  applied  to  the  antenna 
with  the  interference  source  and  receiver  noise  indicated  is  shown  in  Figure  4.     This  pat- 
tern contains  a  peak  at  the  desired  location  as  well  as  nulls  at  the  desired  target  veloci- 
ties and  angles . 


FIGURE  3     Experimental  outputs  from  the  relevant  eight  photodetectors 
of  the  IOP  of  Figure  2  in  the  computation  of  the  complex- 
values  weights  for  a  multi-dimensional  antenna  with  space 
and  time  adaptivity  [13]  . 
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TARGET  ANCLE 


FIGURE  4     Adaptive  antenna  pattern  obtained  from  the  weights  computed 
from  the  IOP  laboratory  system { 13 ]  . 


5.      ERROR  SOURCE  ANALYSIS 


To  determine  the  accuracy  of  these  weights,  we  calculated  the  RMS  error  between  the  exact 
weights  and  those  in  (20) .  This  error  was  2.3%.  The  true  measure  of  the  accuracy  of  such  a 
system  is  its  effect  on  the  SNR  of  the  adaptive  antenna  pattern.  We  define  this  accuracy  as 
a  function  of  the  iteration  number  j  as 


SNR(j)  = 


p0|E(e0,j: 


£  p    E(e  ,j 

,  m  1  m  J 
m=l 


(21) 


The  SNR  obtained  from  the  exact  weights  was  14.96dB,  whereas  the  SNR  obtained  from  the  anten- 
na pattern  resulting  from  application  of  the  weights  computed  on  the  IOP  system  as  in  (20) 
was  14.7dB   (obtained  from  Figure  4) .     Thus,   only  a  negligible  0.26dB  difference  results  be- 
tween the  antenna  pattern  obtained  from  the  exact  weights  and  those  computed  on  our  labora- 
tory IOP.     We  produced  a  theoretical  model  of  the  IOP  as  in   (12)    and  from  this  model  we  de- 
veloped the  theoretical  system  performance  shown  in  Figure  5  for  the  case  of  no  errors  (top 
curve)    and  for  a  system  with  2.5%  residual  mask  errors    (bmn)  and  0.5%  detector  noise  errors 
(tjj,)  .     Our  experimental  data  and  simulations  obtained  for  over  ten  different  case  studies 
satisfied  these  theoretical  trends  in  Figure  5  as  well  as  the  general  upper-bound  error 
model  we  developed  [13]. 


6.  SUMMARY/CONCLUSIONS 

In  this  paper,  we  have  described  a  new  and  general-purpose  optical  matrix-vector  proces- 
sor with  emphasis  on  its  use  in  APAR  processing.     We  have  experimentally  demonstrated  the 
use  of  this  laboratory  IOP  system  for  multi-dimensional  phased  array  radar  processing.  The 
SNR  of  the  results  obtained  indicate  that  the  system  can  achieve  excellent  performance  and 
accuracy  and  that  the  performance  obtained  is  within  the  error  bounds  predicted  by  our  the- 
oretical analysis. 
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FIGURE  5     Output  antenna  pattern  SNR ( j )    as  a  function  of  the  number  of 
iterations  j   for  no  IOP  system  errors   (Ab  =  At  =  0)   and  for 
typical  experimental  IOP  errors    (Ab  =  0.025,   At  =  0.005). 
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APPENDIX  A    (OPTIMAL  WEIGHTS) 

Various  performance  measures  exist  for  an  APAR .     In  this  appendix,  we  first  consider  the 
adaptive  weights  w  that  minimize  the  mean-square  error  and  then  the  weights  that  maximize 
the  output  SNR.     We  show  that  optimizing  both  of  these  criteria  results  in  the  same  set  of 
equations . 

We  first  consider  the  mean  square  error  measure  between  the  signal  s  ( t)  exp  ( jtot)    and  the 
array  output  vg(t)    in    (2).     This  mean  square  error  is 


ms 


-    ls(t)eJ       -  vQut(t) | 


s (t) eJ       -  w  z (t) 


=  s*(t)s(t)   -  s*  (t)  e"ju)tzT(t) 


-s(t)e:ujt(w*)Tz*  (t)    +    (w*)Tz*  (t)  zT(t)w     ,  (Al) 

where  (•)  with  a  bar  over  everything  denotes  the  expectation  operator.  The  w  =  {w  }  weight 
that  produce  the  minimum  mean  square  estimate  of  the  signal  are  obtained  by  setting  the  par 
tial  derivatives  with  respect  to  the  w^  in   (Al)    =  0,  i.e. 


Ve       =   (w*)Tz*  (t)  zT(t)    -  s*(t)e  ju)tzT(t)    =  0.  (A2) 
ms         —      —  — 


The  optimal  weights  thus  solve 


Hw  =  PQs*  .  (A3) 


Next,  we  consider  detection  of  a  signal  in  the  presence  of  noise.     In  this  case,   the  out 
put  SNR  is  the  parameter  to  be  optimized  for  the  adaptive  phased  array.     We  optimize  this 
parameter  by  proper  selection  of  w.     The  SNR  at  the  antenna's  output  is 
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SNR  = 


■  Zn  W^m(t)e" 
m=l  1=0 

T  T 
Pq (w* )    s*s  w 

T 

(w*)  Mw 


(A4) 


The  weights  that  maximize  (A4)  are  obtained  by  setting  the  partial  derivative  of  (A4)  with 
respect  to  the  w.  equal  to  zero,  i.e. 


VSNR  =  P, 


T  T        T  T        T  T 

(w*)   Mw(w*)    s*s     -    (w*)    s*s  w(w*)  M 

[  (w*)  TM  w]  2 


(A5) 


It  is  easily  shown  that  the  optimal  weights  that  maximize    (A4)  satisfy 


(w*)Ts* 


T 

(w*)  Mw 


M  w  =  yM  w 


(A6) 


The  scale  factor  u  in  (A6)  does  not  affect  SNR.  Thus,  the  same  linear  algebraic  equation 
in  (3a)  satisfies  both  performance  criterion.  The  maximum  SNR  possible  is  sTM~ls*  and  is 
independent  of  u. 

APPENDIX  B    (SYSTOLIC  ARRAY  REALIZATION) 


We  now  describe  an  optical  version  of  a  systolic  array  processor  that  is  capable  of  sol- 
ving matrix-vector  equations  using  only  1-D  AO  transducers  rather  than  2-D  spatial  light 
modulators  as  in  Figure  1.     We  consider  the  simple  case  of  performing  the  banded  matrix- 
vector  product 


11 

b21  b22 
b31  b32  b33 


b42  b43  b44 


al~ 

dl_ 

d2 

a2 

dN 

aN 

(Bl) 


The  proposed  optical  systolic  array  processor  of  Figure  Bl  achieves  this.  In  this  system, 
the  LED  inputs  are  the  diagonal  elements  of  the  matrix  in    (Bl) ,  i.e. 


42 


0    b31  0 


b43     0     b32     0  b21 


o    b33    o    b22    o  bxl 


LED3 (t) 
LED 2  (t) 
LED- ( t) 


(B2) 


and  the  input  to  the  AO  cell  is  the  vector  in  (Bl) 

...     0      a.,     0      a„  0 


AO.n(t) 


(B3) 


When  the  outputs  at  the  detector  in  Figure  Bl  are  summed  and  shifted  by  element  per  input 
element  cycle,   the  desired  matrix-vector  product  results  at  the  output.     When  this  result 
is  fed  back  to  the  AO  cell  in  Figure  Bl ,   an  iterative  matrix-vector  systolic  array  proces- 
sor results. 
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FIGURE  Bl     Schematic  diagram  of  an  optical  systolic  array 
iterative  matrix-vector  processor  [6]. 
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Introduction 

A  2D  Magneto-optic  device  with  high  speed  non-volatile  random  access  capability  is  described  in  this  paper.    Drive  requirements 
and  structure  are  compatible  with  LSI  technology.    Pixel  switching  is  electromagnetic,  non-thermal,  random  access  addressed  by 
current  pulses  in  crossed  conductors  deposited  on  the  garnet.    The  perfection  of  the  solid  state  crystal  structure  provides  the 
potential  for  very  high  quality,  high  resolution,  optical  characteristics.    The  device  has  been  named  LIGHT-MOD.™    This  stands 
for  Litton  Iron  Garnet  H  Triggered  Magneto  Optic  Device. 

The  Magneto  Optic  Device 

The  LIGHT-MOD  consists  of  bismuth  substituted  iron  garnet  films  grown  on  nonmagnetic  substrates  with  the  uniaxial 
anisotropy  oriented  perpendicular  to  the  plane  of  the  film  and  of  magnitude  greater  than  the  saturation  magnetization  of  the  film. 
The  film  is  structured  into  isolated  mesas  as  can  be  seen  in  the  scanning  electron  microscope  picture  shown  in  Figure  1.    The  drive 
conductors  are  deposited  and  structured  using  conventional  semiconductor  metals,  dielectrics,  and  photolithography.    A  picture  of 
one  of  the  48  x  48  test  arrays  mounted  on  a  printed  circuit  board  is  shown  in  Figure  2. 


Figure  1.    SEM  structured  film  Figure  2.    48  x  48  LIGHT-MOD 


The  use  of  the  LIGHT-MOD  as  a  light  valve  is  depicted  in  Figure  3.    Vertically  polarized  light  exits  from  the  polarizer.    As  the 
light  passes  through  the  film  the  polarization  direction  is  rotated  clockwise  or  counter-clockwise  dependent  upon  the  sense  of 
magnetization  of  the  film.    The  amount  of  rotation  depends  upon  the  Faraday  constant  of  the  material  0p  degrees  per  micron 
(°/Atm),  and  upon  the  thickness  of  the  film.    As  depicted,  the  analyzer  has  been  set  such  that  it  blocks  the  light  for  the  cells 
magnetized  with  the  North  Pole  into  the  film  and  transmits  light  for  the  cells  with  the  opposite  direction  of  magnetization. 

To  change  the  direction  of  magnetization  of  a  mesa,  current  is  passed  through  the  two  adjoining  conductors  intersecting  at  the 
mesa.   The  combined  magnetic  fields  will  switch  the  state  of  that  mesa  only.    The  magnetic  field  generated  by  current  flowing  in  a 
single  conductor  is  insufficient  to  change  the  state  of  a  mesa.   The  LIGHT-MOD  switching  takes  place  in  two  steps  as  depicted  in 
Figure  4.    The  switching  threshold  or  nucleation  is  established  by  the  anistropy  field  (Hj^)  minus  the  demagnetizing  field  (4n-  M$) 
and  occurs  as  coherent  rotation  or  flux  reversal.    A  domain  wall  is  formed  and  then  propagates  via  domain  wall  motion.    When  the 
wall  reaches  the  bottom  of  the  film  the  cell  has  been  nucleated.    Removal  of  the  drive  current  at  that  time  results  in  the  pixel 
being  demagnetized  or  stripped  out.    By  maintaining  drive  currents  until  the  wall  has  propagated  to  the  opposite  corner,  and 
assuming  the  magnetic  field  exceeds  the  saturation  field  (H$)  at  that  distance  from  the  drive  lines,  the  mesa  will  complete  switching 
and  will  then  be  saturated  in  the  opposite  magnetization.    It  will  remain  in  that  state  indefinitely  until  it  is  nucleated  in  the 
opposite  direction  by  drive  line  currents. 
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Figure  3.    Operation  of  magneto-optic  pixels 
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Figure  4.   Post  switching  process  [cross  section  of  posts] 

The  LIGHT-MOD  achieves  pixel  stability  by  meeting  the  condition,  Hk  -  4ttMs  >  Hsat  as  described  by  Pulliam  et  al  (Ref  1) 
As  is  shown  m  Figure  5,  if  Hk  -  4ttMs  >  Hsat  the  pixel  states  can  only  be  the  saturated  states,  unless  drive  currents  are 
terminated  prior  to  completion  of  switching  as  described  above. 


Figure  5.    Magnetic  stability 
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The  pixel  switching  is  quite  fast.    Wall  velocities  as  high  as  900  meters  per  second  have  been  observed  in  these  expitaxial 
garnets  under  high  drive  fields  and  with  field-components  normal  to  the  wall  (Ref.  2).    Both  conditions  are  achieved  in  this  device. 
Switching  times  of  about  one  microsecond  are  observed  for  100  micron  square  pixels.    Average  wall  velocities  are,  thus,  approxi- 
mately 100  m/sec.    Speed  of  operation  is  a  function  of  cell  size  and  mode  of  operation.   The  smaller  the  cell  size  the  faster  the 
cells  switch,  assuming  the  same  wall  velocity.   Therefore,  for  a  given  array  area  the  total  switching  time  can  be  approximately 
constant  whether  it  is  structured  into  a  128  x  128  array  or  a  512  x  512  array.    Since  the  image  is  non  volatile,  refresh  is  not 
required  and  only  those  pixels  which  are  to  be  changed  need  to  be  switched.   This  results  in  very  low  power  consumption  and 
effective  bandwidth  compression  during  image  transmission. 

The  LIGHT-MOD  can  operate  at  a  much  higher  speed  than  that  of  the  devices  utilizing  the  temperature  compensation  writing 
technique  reported  by  Krumme  et  al  (Ref.  3)  and  Hill  and  Schmidt  (Ref.  4),  Hill  and  Schmidt  used  bismuth  doped  garnets  showing 
a  compensation  temperature  near  room  temperature.    Switching  was  achieved  by  elevating  the  temperature  of  the  addressed  pixel 
while  applying  a  reversal  field.    The  lack  of  magnetism  at  the  compensation  temperature  assures  pixel  stability  in  their  device  after 
switching. 

The  differences  between  their  device  and  the  LIGHT-MOD  are 

1.  Hill  and  Schmidt  use  Temperature  triggering  rather  than  the  H  Field  (magnetic  field)  triggering  used  by  the 
LIGHT-MOD. 

2.  The  Hill  and  Schmidt  device  requires  operating  at  the  compensation  temperature  whereas  the  LIGHT-MOD  can  be 
operated  over  a  wide  temperature  range  up  to  the  Curie  temperature. 

The  processing  steps  for  producing  the  LIGHT-MOD  are  relatively  simple  in  comparison  to  semiconductor  or  bubble  memory 
devices,  and  require  only  relatively  simple  photolithographic  equipment  which  has  high  throughput  capability.    Arrays  having  pixels 
from  10  micron  size  to  400  microns  have  been  evaluated.    Pixel  size  and  shape  is  not  critical.    Maximum  pixel  size  which  can  be 
saturated  with  the  drive  lines  is  in  the  250  to  400  micron  range.    External  field  coils  can  be  utilized  to  advantage  to  complete 
switching  for  some  applications. 

The  optical  operation  of  the  LIGHT-MOD  is  influenced  by  absorption  and  Faraday  rotation.    Figure  6  shows  how  the  overall 
transmission  is  influenced  by  differences  in  Faraday  rotation  for  a  given  absorption  coefficient.    It  can  be  seen  that  the  light 
transmission  efficiency  is  significantly  increased  with  increase  in  Faraday  rotation.    Work  is  therefore  in  progress  to  increase  the 
Faraday  rotation  of  the  film. 


Figure  6.    Light  transmission  vs.  film  thickness 
Optical  processing  applications 

The  LIGHT-MOD  can  be  used  in  optical  signal  processing  systems  in  several  capacities.    The  device  can  be  used  as  a  random 
access  optical  scanner  by  allowing  one  (or  several)  pixels  to  be  transmissive  at  a  time.    Therefore  only  a  selected  portion  of  the 
incident  field  is  allowed  to  propagate  through  the  plane  of  the  magneto-optic  device.    If  the  LIGHT  MOD  is  interfaced  with  a  high 
quality  photodetector  such  as  a  photomultiplier,  a  2-D  random  access  image  sensor  could  be  constructed  that  combines  in  principle 
the  excellent  detector  properties  of  the  photo-multiplier  and  the  resolution  and  speed  of  the  magneto-optic  device.    Other  possible 
uses  of  the  device  include  logic  operations  on  2-D  binary  data  fields  and  as  a  programmable  scanning  spatial  filter.    The  most 
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obvious  use  of  the  LIGHT-MOD  is  a  spatial  light  modulator.   The  data  to  be  processed  by  the  optical  computer  is  written  on  the 
device  through  the  electrode  structure.    If  a  raster  scan  format  is  used,  each  pixel  can  be  addressed  sequentially  by  activating  the 
appropriate  crossed  electrodes  or  each  line  can  be  written  in  parallel  by  activating  one  of  the  horizontal  electrodes  and  simultaneously 
applying  the  signal  corresponding  to  a  raster  line  to  the  vertical  electrodes.    Each  pixel  or  line  can  be  switched  in  approximately 
1  ^sec  which  allows  a  512  x  512  frame  to  be  written  in  0.256  seconds  serially  and  0.512  milliseconds  in  parallel.    The  parallel 
addressing  scheme  is  attractive  because  of  its  speed,  but  the  driving  electronics  is  more  complex  in  such  a  configuration.  The 
scanning  format  is  not  restricted  to  the  conventional  raster;  any  desired  addressing  sequence  can  be  implemented  with  digital  control. 
The  size  of  the  arrays  is  projected  to  be  512  x  512  for  the  next  generation  of  the  devices,  with  each  square  pixel  having  an  active 
area  approximately  (50  -  100  jump  and  separation  of  (6  -  12  yum).    Garnet  films  can  be  grown  with  good  optical  quality  (perfect 
crystals);  therefore  magneto-optic  devices  can  have  low  scatter  level  and  optical  flatness,  which  will  allow  them  to  be  used  in 
coherent  optical  systems.   The  electrode  structure  will  have  to  be  masked-off  to  minimize  scatter  as  well  as  the  background  light 
level.    Another  important  consideration  is  the  efficiency  of  the  magneto-optic  device,  measured  as  the  percentage  of  the  incident 
light  that  is  transmitted  as  spatially  modulated  light.    There  are  three  primary  factors  that  determine  the  efficiency.    The  active  post 
area  A\,  relative  to  the  light  blocking  electrode  area  A2  results  in  an  attenuation  of  the  incident  beam  by  A\/(A\  +  At).    Since  the 
minimum  feature  size  that  can  be  fabricated  at  a  given  film  thickness  is  fixed,  Aj  =  A2  for  fine  resolution  devices  and  approxi- 
mately half  the  light  is  blocked  by  the  electrodes.   The  second  factor  is  the  optical  absorbtion  by  the  garnet  films.    The  attenuation 
constant  of  the  material  that  is  presently  used  is  2000/cm.    With  a  thickness  of  5  um  the  resulting  transmission  is 
e-(5  x  10"  )(2000  x  10-)  =  (0.37).    Finally  the  analyzer  following  the  LIGHT-MOD  will  absorb  a  portion  of  the  light.  The 
contrast  of  the  light  output  is  defined  by  the  following  equation: 

T  T  9  9 

_    'max      min     _    cos-  (0  -  6)  -  cos^-  (0  +  0) 

T         +  T  =         9  9 

'max     'min  cos''  (0  -  0)  +  cosz  (0  +  0) 


where  0  is  the  angle  of  the  analyzer  with  respect  to  the  polarization  angle  of  the  incident  light  and  ±6  is  the  angle  with  respect  to 
the  incident  light  polarization  by  which  the  polarization  of  the  light  is  rotated  when  it  propagates  through  the  magnetized  film.  In 
most  applications  it  is  desirable  to  maximize  the  contrast.    This  is  achieved  by  setting  0  +  6  =  90°,  which  results  in  a  contrast  ratio 
equal  to  one,  (neglecting  noise).    The  intensity  of  the  output  light  is  then  attenuated  by  cos^  (0  -  0)  =  cos^  (90°  -  20). 
Obviously  the  attenuation  is  minimized  if  0  =  45°.    At  the  present  time  however  0  (theta)  is  approximately  10°.    The  rotary  power 
of  the  magnetized  film  is  proportional  to  its  thickness.   Therefore  it  would  appear  that  one  might  improve  the  contrast  by  growing 
thicker  films.    The  optical  attenuation  however  grows  exponentially  with  the  thickness,  and  therefore  for  a  given  material  there  is  an 
optimum  thickness  at  which  the  films  are  grown.    There  are  other  factors  that  also  affect  the  efficiency,  such  as  polarizer 
coefficients  of  about  0.8  to  0.9,  and  reflections  at  the  surface,  which  could  be  eliminated  with  anti-reflection  coatings.  The 
efficiency  of  present  devices  is  calculated  to  be  approximately  e-^  x  ^   X2000  x  10-)  cos2  jq  =  (0.04).    Indications  are  that  the 
research  that  is  presently  being  conducted  at  Litton  will  result  in  improvements  in  the  material  properties  which  will  result  in 
significantly  more  efficient  devices. 

The  present  devices  are  binary  spatial  light  modulators;  the  output  light  intensity  amplitude  at  each  pixel  can  take  only  one  of 
two  possible  states  depending  on  which  direction  the  pixel  is  magnetized.    Therefore  these  devices  can  only  be  used  in  applications 
where  the  input  data  to  the  optical  processor  is  binary.    Text  processing  and  certain  robotics  applications  are  important  examples 
where  the  input  scenes  are  indeed  binary.    In  order  to  extend  the  applicability  of  the  LIGHT-MOD  to  a  broader  class  of  problems 
however,  it  is  important  to  be  able  to  represent  many  gray  scales.    In  the  following  section  we  will  explore  several  methods  for 
doing  that. 


Gray  scale  considerations 


The  basic  philosophy  we  use  to  represent  more  than  two  gray  levels,  is  to  allocate  more  than  one  magnetic  mesa  to  each  pixel, 
in  a  trade-off  of  gray  scale  for  resolution.    The  value  of  the  data  at  each  pixel,  is  represented  by  a  binary  word  and  each  bit  of  the 
word  sets  the  magnetization  state  of  one  of  the  mesas  that  comprise  the  pixel.    There  are  several  distinct  ways  that  can  be  used  to 
combine  mesas  and  we  will  examine  each  of  them  separately.    At  the  end  of  tins  section  we  will  demonstrate  how  these  methods 
can  be  integrated  into  a  simple  structure  that  can  have  many  gray  scales.    In  the  first  method  we  will  consider  several  identical 
devices  which  are  placed  in  parallel  as  shown  in  Figure  7.    The  state  of  the  magnetization  of  each  pixel  is  set  by  the  binary  word 
of  the  corresponding  digitized  data  point. 
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Figure  7.    Parallel  devices 
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Each  device  is  magnetized  by  a  different  binary  bit.    The  state  of  the  pixels  of  device  3  in  Figure  7  is  set  by  the  least 
significant  bit,  and  the  most  significant  bit  is  applied  to  device  1.   The  three  LIGHT-MOD's  are  uniformly  illuminated  with  polarized 
light,  and  an  analyzer  is  placed  after  each  device.   The  direction  of  the  analyzer  is  such  that  no  light  is  transmitted  through  a  pixel 
that  is  in  the  zero  state.    The  intensity  of  the  transmitted  light  when  the  pixel  is  in  the  binary  one  state  is  denoted  by  IQ.  The 
light  from  the  three  devices  is  brought  into  coincident  focus  using  3  beamsplitters  and  an  imaging  lens  as  shown  in  Figure  7.  The 
optical  distances  between  each  of  the  devices  and  the  lens  are  equal.    The  light  intensity  in  the  image  plane  is  the  sum  of  the 
intensities  of  the  light  contributed  by  each  of  the  devices.   The  beamsplitters  are  assumed  to  be  50  percent  transmissive.  Therefore 
the  light  from  the  3rd  device  (corresponding  to  the  least  significant  bit)  is  attenuated  by  1/8,  whereas  the  light  from  the  2nd  and 
1st  devices  is  attenuated  by  1/4  and  1/2  respectively.   This  arrangement  produces  a  linear  mapping  from  the  3-bit  input  binary  word 
to  the  output  intensity  level.   The  eight  possible  input/output  states  are  tabulated  in  the  table  of  Figure  7.    In  principle,  N  devices 
can  be  used  in  parallel  to  produce  2n  gray  levels.    A  relative  disadvantage  of  this  technique  is  the  rather  complex  optical  system 
and  the  strict  alignment  requirements. 

The  disadvantage  can  be  overcome  by  using  on  the  same  device  several  parallel  mesas  which  will  collectively  represent  many 
gray  levels.    In  addition,  we  do  not  need  to  attenuate  (or  illuminate)  the  different  pixels  by  different  amounts,  if  an  area  modula- 
tion scheme  is  used.   This  is  shown  in  Figure  8.    In  this  instance  four  mesas  comprise  one  pixel  and  the  relative  areas  of  pixels  1, 
2,  3  and  4  are  1,  2,  4  and  8  respectively.   The  direction  of  magnetization  of  each  mesa  can  be  independently  set,  therefore  the 
pixel  consisting  of  four  mesas  can  be  set  in  any  of  2^  =  16  states.    We  assume  that  the  analyzer  blocks  all  of  the  light  transmitted 
through  a  mesa  in  the  binary  zero  state  and  we  denote  the  intensity  transmitted  through  mesa  1  when  it  is  in  state  one,  by  I. 
The  output  light  intensity  averaged  over  one  pixel  area  is  listed  in  all  possible  input  states  in  the  table  of  Figure  8.    We  see  that 
the  output  intensity  can  attain  2^"  =16  distinct  states  corresponding  to  16  equally  separated  intensity  levels.   This  monolithic 
approach  is  very  attractive,  however  the  gray  scale  is  achieved  at  the  expense  of  reduced  resolution  and  addressing  speed. 
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Figure  8.    Composite  pixels 


The  two  methods  discussed  thus  far  utilize  several  mesas  in  parallel  that  are  combined  to  form  a  single  pixel  at  which  multiple 
gray  levels  can  be  represented.   Alternatively,  a  cascade  of  mesas  may  be  used.   This  configuration  is  depicted  in  Figure  9.  Three 
separate  devices  are  cascaded  by  either  placing  them  one  against  the  other  or  imaging  one  into  the  next.   The  cascade  is  illuminated 
with  polarized  light  and  an  analyzer  is  placed  after  all  three  devices.    Let  d  (-0)  be  the  angle  by  which  the  polarization  of  the 
incident  light  is  rotated  when  it  propagates  through  a  mesa  magnetized  in  the  direction  corresponding  to  a  binary  1  (0)  state.  The 
angle  of  the  analyzer  with  respect  to  the  input  polarization  is  denoted  by  <j>.   There  are  2^  =  8  possible  states  at  which  the  triple 
cascade  can  be  set.   Four  of  these  states  are  degenerate  however,  and  only  four  distinct  output  intensity  levels  result  (see  the  table 
in  Figure  9).   The  output  intensities  corresponding  to  the  four  distinct  states  are  plotted  as  a  function  of  0  for  6  =  10°.  One 
possible  mode  of  operation  is  to  choose  <j>  =  120°  so  that  the  output  intensity  is  zero  for  the  (0,0,0)  input  state.   This  choice  of 
4>  maximizes  the  contrast,  but  the  four  distinct  levels  are  not  separated  by  exactly  equal  amounts,  and  the  input  states  are  not 
mapped  linearly  into  output  intensity,    d  and  (j>  can  be  chosen  properly  to  ensure  linearity,  or  a  desired  non-linearity  can  be 
implemented.   A  limitation  of  this  technique  is  that  relatively  few  distinct  output  gray  levels  result,  due  to  the  degeneracies.  These 
degeneracies  can  be  removed  by  using  three  devices  with  garnet  films  of  unequal  thicknesses.   Each  device  rotates  the  polarization  by 
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Figure  9.   Cascaded  LIGHT-MOD's  with  films  of  equal  thicknesses 

different  angles,  denoted  by  6\,       and  #3.   The  1?  =  8  distinct  output  states  and  the  corresponding  intensities  are  listed  in  the 
table  for  the  triple  cascade  shown  in  Figure  10.   The  output  light  intensity  is  plotted  as  a  function  of  <t>  for  all  the  possible  states, 
with  6\  =  20°,  &2  =  10°  and  6  3  =  5°.   One  possible  operating  point  is  0  =  -55°.   At  that  point  there  are  8  distinct  intensity 
levels  (gray  shades).    In  principle  any  number  N  of  devices  can  be  cascaded,  resulting  in  2n  gray  shades.    In  practice,  however,  N  is 
limited  by  the  compounded  light  losses  when  using  multiple  films. 


Figure  10.   Cascaded  LIGHT-MOD's  with  films  of  unequal  thicknesses 
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In  our  discussion  of  gray  scale  thus  far,  the  mesas  of  the  LIGHT-MOD  have  been  treated  as  bistable.    In  fact,  the 
demagnetized  state  is  also  stable,  and  it  can  be  accessed  by  applying  the  appropriate  signals  to  the  electrodes,  as  previously  discussed. 
Mesas  that  are  magnetized  in  all  three  stable  states  viewed  through  crossed  polarizers  are  shown  in  Figure  11.    In  the  demagnetized 
state,  half  the  area  of  each  mesa  is  magnetized  in  one  direction  and  the  other  half  in  the  opposite  direction.   Therefore,  the 
intensity  of  the  light  transmitted  through  a  pixel  in  the  demagnetized  state  is  the  average  of  the  intensities  of  the  two  magnetized 
states.   The  output  intensity  levels  are  listed  in  the  table  of  Figure  12  for  each  of  the  three  stable  states.   The  intensity  of  the 
three  stable  states  vs.  the  angle  <j>  of  the  analyzer  is  plotted  in  Figure  12,  for  d  =  10°.   One  convenient  operating  point  is  0  =  80°, 
where  we  obtain  maximum  contrast  and  three  equally  separated  intensity  levels. 


Figure  11.   Three  stable  states  Figure  12.   The  demagnetized  state 

Each  of  the  gray  scale  methods  we  considered  has  advantages  and  limitations.   Therefore,  we  can  expect  that  a  combination 
of  these  techniques  will  be  the  optimum  method  for  constructing  a  device  with  many  gray  scales.    We  will  consider,  as  an  example, 
one  possible  implementation  that  uses  a  double  cascade  of  four  parallel  mesas,  each  having  three  stable  states.   A  cascade  of  two 
mesas  can  be  conveniently  fabricated  by  growing  garnet  films  on  both  sides  of  the  same  substrate  wafer.   The  mesas  on  each  side 
are  addressed  by  separate  electrode  structures.    Four  mesas  of  unequal  areas  comprise  a  pixel  on  each  side  of  the  substrate.  This 
configuration  is  shown  in  Figure  13.   The  8  separate  mesas  can  be  set  at  3^  =  6561  possible  combinations.    It  can  be  shown  that 
a  cascade  of  two  mesas  of  equal  thickness,  each  of  which  has  three  stable  states,  results  in  six  distinct  intensity  levels.    Since  the 
pixel  of  Figure  13  consists  of  4  such  double  cascades,  the  number  of  non-degenerate  output  intensity  levels  is  64  =  1296.  There- 
fore, with  a  relatively  simple  monolithic  structure,  a  large  number,  of  gray  levels  can  be  realized  with  the  magneto-optic  device. 
To  achieve  this  dynamic  range,  however,  the  device  must  be  fabricated  with  corresponding  uniformity  of  response  across  the  array. 


COMBINATION 


SIDE  VIEW 


Figure  13.   A  possible  hybrid  implementation 
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Conclusion 

We  believe  that  the  unique  capabilities  of  the  LIGHT-MOD  can  provide  elegant  solutions  to  problems  in  the  optical  signal 

processing  area. 

In  addition  to  being  potentially  applicable  to  the  optical  processing  two  dimensional  spatial  light  modulator,  processor,  and 
detector,  applications  range  from  large  screen  projection  display  systems  to  helmet  mounted  or  head  up  displays,  hard  copy 
displays,  and  hand  held  direct  view  displays. 

Acknowledgements 

The  authors  are  gratefully  indebted  to  Dr.  B.  MacNeal,  D.  Cox,  W.  Robinson  and  S.  Mills  for  device  design,  system  design, 
electronic  assembly,  and  software,  and  to  Dr.  M.  Shone,  Dr.  K.  Vermuri,  and  Dr.  R.  Belt  (Airtron  Division,  Litton  Industries)  for 
garnet  films,  and  Dr.  G.  Pulliam  for  technical  editing. 

References 

1.  G.  R.  Pulliam,  W.  E.  Ross,  B.  E.  MacNeal  and  R.  F.  Bailey,  "Large  Stable  Magnetic  Domains,"  Paper  GD-1,  presented  at 
the  1981  3M  Conference. 

2.  K.  Vural,  F.  Humphrey,  "Dynamic  Wall  Deformation  in  Bubble  Garnet  Materials,"  J.  Appl.  Phys.,  Vol.  51,  No.  1, 
pp.  549  (1980). 

3.  J.  P.  Krumme,  J.  Verwell,  J.  Haberkamp,  W.  Tolksdorf,  G.  Barrels,  and  G.  P.  Espinoza,  "Thermomagnetic  Recording  in 
Thin  Garnet  Layers,"  Apps.  Phys.  Lett.,  451-453  20,  (1)  1972. 

4.  Bernard  Hill  and  Klaus-Peter  Schmidt,  "Thin-Film  Iron  Garnet  Display  Components,"  SID  1979  International  Symposium. 

5.  D.  Casasent,  "Spatial  Light  Modulators,"  Proc.  IEEE,  65,  143  (1977). 

6.  A.  R.  Tanguay,  "Spatial  Light  Modulators  For  Real  Time  Optical  Processing,"  in  "Future  Directions  for  Optical 
Information  Processing,"  pp.  52,  Final  Report,  Texas  University,  Lubbock,  Texas,  1980. 


198  /  SPIE  Vol.  341  Real  Time  Signal  Processing  V (1 982) 


Problems  in  two  dimensions 
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Abstract 

This    paper   describes    the   problems    encountered   when   one   attempts    to    architect    a  two 
dimensional   acous to-opt ical    signal    processing   system  using    conventional    Bragg   cell  devices. 
Specifically   addressed   will   be    the    solution   to    the   problem  by    utilizing   degenerate  and/or 
tangential  mode   birefringent    A.O.      devices.      This    novel    configuration  will   demonstrate  a 
highly   efficient    hybrid    time/space    integrating    two-dimensional    processor    for  range/doppler 
ambiguity    function  calculation.;'; 

Introduction 

An   bulk   acousto-optic    (A.O.)    device    consists    of    1.)    a   crystal    bulk,    2.)    a  piezoelectric 
transducer   and    3.)    a    tuning   network.      Electronic    signals    can   be    applied    to    the  tuning 
network  which   passes    components    of    the    signal    to    the    transducer.      The    transducer    is  an 
electro-mechanical    device   which   changes    its    thickness   proportionally    to    the   amplitude  of 
the   signal.      The    transducer    is    bonded   directly    to    the   crystal    bulk.      Consequently,    as  the 
transducer   physically   vibrates,    a    sound   wave    is    launched    into    the    crystal.      This  acoustic 
wave    travels    to    the   opposite   end   of    the    crystal   where    it    is    absorbed.      Along    the   way,  the 
acoustic    field    induces    an   acoustic    strain   on    the   molecules    of    the    crystal.      If    light  is 
passed   through   the    crystal,    in   a   somewhat    perpendicular   direction    to    that    of    the  acoustic 
wave,    then   an    interesting   phenomenon    occurs.      The    light    slows    down   as    a    function   of  the 
degree   of    strain    induced   by    the   acoustic    field.      The   compression   of    the   crystal    due    to  the 
acoustic    field   causes    a   change    in    the   crystal's    index   of    refraction  which    in    turn  causes 
the    light    to    propogate   slower    through   the    crystal    at    the    compression   point.      This    delay  of 
the    light    differs    across    the   crystal    depending   on    the    shape   of    the    traveling   acoustic  wave 
which,    of    course,    is    proportional    to   a    section   of    the    input    signal.      The    retardation   of  the 
light    can   be    thought    of    as    a   phase   delay.      The   A.O.      device    can   now  be   modeled   as  a 
temporal    optical    phase  modulator. 
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Depending   on    the    length  of    the    crystal   bulk,    the   device   holds   a   certain    length  of  the 
input    signal    in   a  moving   window   fashion.      That    is,    the   device   appears    as    a  moving  time 
aperture    impressed   upon   the    input    signal.      Consequently,    the   device    is    considered    to  have 
storage.      The   device   has   a   storage    time    -  T,    and   a   bandwidth   -    B,    therefore  a 
t ime / b an dw id t h   product    -  TB  .      Device   TB's    range    in   practice   from   1    to    2,000.      A  typical 
device  may   have   a   bandwidth   of    100   MHz   and   a    time   aperature   of    10   usee.      (TB   =    1,000).  The 
interaction   of    light   with   sound    is    explained    in   greater   detain    in   Damon's    paper. <1> 


The   concept   of    time/bandwidth    is    often   utilized   when   describing   a   device.      For  the 
example   above,    the   device   can   resolve    1000,    10  ns    samples   of    the    input    signal.      Since  all 
1,000   samples    can   be    illuminated    in   parallel    by   the    incident    light,    the   device   can  be 
further  modeled    as    a    tapped   delay    line/mixer.      If    coherent    light    is    used,    such   as    a  laser, 
followed   by   a    lens   with   a    spherical    surface,    at    the    focus    of    the    lens    the    intensity  Fourier 
transform,    or    spectrum,    can  be    seen . <2>   This    is    because    the   quadratic    surface   of    the  lens 
acts    as    a    chirp    filter   placing   quadratic    tap  weights    on    the   delay    line.      This    simple  effect 
has    has   been   well    documented    <3,4>   and    has    led    to    several    field   deployed   optical  systems. 
One   can   see    the    advantage   of    such    systems   when   compared    to   an   electronic    system  performing 
the    same   operation.      The   optical    system   performes    1,000   mu 1 t ip 1 y / add i t ions    (MA)    every    10  ns 
for   an   effective    throughput    rate   of    10   E+ll   MA' s/ sec.      Current    digital  machines    approach  10 
E+8   MA's/sec.      for   serial  machines.      Parallel   digital    architectures    are    thus    required  to 
meet    the    speed    realized    by    the   optical  processor. 


Processing   advantages    accrue  when   architectures    that    utilize  multiple   A.O.  devices 
are   considered.      Two  A.O   devices,    reversed    imaged    onto   each   other,    form   the   basis  for 
one-dimensional    correlator   and    convolver   architectures,    both   space   and    time  integrating. 
Several    architectures    (along   with   a   plethora   of    equations)    are   described   references    5    -  12. 
The   convolution   and    correlation   algorithms    perform  many    signal    processing    tasks    such  as 
high   resolution   chirp-Z    spectrum   analysis;    spread    spectrum   communication  modulation, 
demodulation   and   acquisition;    direction   finding    (DF);    radar   range   or   Doppler  processing, 
and  many  more.      Although   on e -d imen t i on a  1   mult  ice  1 1    systems    compliment   many  signal 
processing   algorithms,    the    throughput    improvement    over    the    single   device   system   is   only  a 
factor   of    two.      Problems    in   one-dimension   have   been   previously   addressed    in   reference  8. 


The    large    throughput   gain    is    realized   when   A.O  devices   are    configured    in  a 
two-dimensional   architecture   such   as    shown    in    figure    1.      Combinatorial   operations    can  be 
achieved    such   as    range   vs.      Doppler   displays,    and    two-dimensional    spectrum  analysis  for 
signals    or    images.  <6 , 13  — 18>    In    two   dimensions,    two   A.O   devices    crossed   orthogonally,  each 
with   a   TB   of    1,000,    will    present    at    the   detector   a   different    1,000   x    1,000    image   or  surface 
for    each   pixel   delay.      Throughput    rate    in    theory    can   reach    10   E+15   MA's/sec.      (This  assumes 
that    two    1   GHz/    1    usee    devices   are   used.)    This    rate    is    limited   by   optical    source   power  and 
detector   performance,    e.g.,    a    1,000   x    1,000   pixel    detector   array    currently   does   not  exist. 
The  A.O.      devices    do  exist. 


2-D  Ambiuuity   Function  Processor 
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Figure    2.      Bragf,   regime    2-D   processor;  Input 
angle;    y,z   plane  coaxial 
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Figure  3 


Bragg  regime  2-D 
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processor;  Input: 
at      Bragg   ang le 


x,z    plane    at  Bragg 
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Bragg   Regime    2-D  Processor 


For    two   dimensional   processor   applications,    operation   of    acousto-optic   devices    in  the 
standard    Bragg    regime   can   cause   extreme   degradation    in   a)    system  bandwidth,    b)  dynamic 
range    (photon    throughput),    c)    phase    linearity   and   d)    system  resolution.      With   the    aid  of 
figures    2    and    3,    an   understanding   of    the   causes   of    degradation   can   be  seen. 
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1.  )    The   deflected    light    does   not    travel    coaxially   down    the    telecentric    imaging  system. 
Most    designers  may   not    consider    this    a   problem  until    the    frequency   response   of    the    lens  is 
taken    into    consideration.      For   example,    in   slow   shear   TeO     device    (commonly   used    in  2-D 
processors)    with    100  KHz   bandwidth   centered    at    150  MHz,    the   upper   cutoff    spatial  frequency 
is    317    lp/mm.      For   comparison   a   good   Nikon    lens   may   process    200    lp/mm.      Consequently  the 
optical    system  will    require   an   expensive   design    to   maintain    bandwidth,    resolution   and  phase 
linearity. 

2.  )    Deflected    first    order    light    from   the    first   A.O.      device   now   enters    the  second 
device   at    the   Bragg   angle    in    the   wrong   dimension.      This    causes    a   "momentum  mismatch",  a 
detuning   of    the   acousto-optic    bandshape    in    the    that    dimension,    thus    a   reduction    in  system 
bandwidth.      This   was    avoided    at    the    first    cell    by   utilizing   perpendicular    illumination  in 
the   vertical   dimension   as    shown    in    the   side   view   of    figure    2.      Unfortunately,    first  order 
deflected    light    also    enters    the    second   A.O.      device   perpendicular    to    its    face   along  the 
acoustic    axis    of   A.O.      cell    2.      This    is    a   disastrous    condition   as    the    second   device  will 
not    deflect    any    light.      The    result    is    no  result! 

The    clever   optical   design   engineer   now   examines    the    problem  and    finds    a    solution  as 
shown    in   figure    3.      He    allows    light    to   enter    the    first    device   not    o>nly   at    the    correct  Bragg 
angle   with   respect    to    the   acoustic    axis    of    the    first    device   but    also   at    the    Bragg  angle 
with   respect    to    the   acoustic    axis    of    the    second   device   as    shown    in   the    side   view   of  figure 
3.      However,    this    solution   only   creates   more  problems. 

1.)   A   reduction    in   bandwidth   at    the    first    Eragg   cell   due    to   momentum  mismatch    in  the 
vertical  direction. 
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With   the   wave   plate    it    is    possible    to   match  t 
matching   at    band   edge    is    still    poor   and  will 
possible    to    redesign   the    transducer  electrode 
either    right    elliptical   or    left    elliptical  in 
to  maintain   the    same  bandwidth. 


he    polarization    at    midband.  Polarization 
result    in    further   bandwidth   reduction.      It  is 

geometry    to    enable    the   device   to  prefer 
put    light    polarization    (but    not    both)    in  order 


Tangential/Degenerate    (TP)    Mode  Regime 
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symetric    about   K(001).      TeO^   devices    commerically   available    (as   of    the   date   of  this 
paper)    are   designed    as   beam   deflectors    and   operate    in    the    Bragg    regime   only.  This 
interaction    is    shown    in    figure   4.      Input    illumination    is    represented    as   K^.      This  input 
as   shown   must    be    right    elliptically   polarized.      The   acoustic    field    is    shown   as   K    .  The 
length   of    this   vector   must    be    sufficient    to    intersect    the  curve,    momentum  matched. 

The   output,    at    positive   Bragg   angle    is    now  K    .      Operation    in    this   mode    leads    to  the 
system  architectures    and   associated   problems    discussed    in    the   previous  section. 

Upon    further   examination   of    figure   4,    other  modes   of    operation   appear    feasible,    due  to 
device   symetry   with   this    crystal    cut.      These   have   been   defined    by   Hecht<19>   as  tangential 
and  degenerate. 


Tangential   mode    is    realized   by    the   K  ^  ,    ^a2>    K2    lnt erac L  lon  •      Input  light 
right    elliptically    polarized    enters    as   K  ^    to  match   on   n  ^  .      Correct    choice   of  input 
frequency,    K    ^   will    allow   a    "tangential"   momentum   match    along    n^    precisely    at  the 
K(001)    or   Z   toptical)    axis.      The   term   tangential    is    used    because    the  match    is    tangential  to 
the   n      surface.      For    this    crystal    cut,    K        corresponds    to   an   input    frequency  of 
37.5   KHz.      The   deflected   output   beam   is    now   collinear   with   the   optical    axis,    K(001),  and 
thus   perpendicular    to    the    face   of    the   ac ous t o -op t i c   device   as   well    as   being  left 
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elliptically   polarized.      Notice,  however, 
tangential    phase  match   point    on   n      at  the 
This   will    cause   a   second   order  deflection, 
conjugate   angle   of   K^.      This    second  order 
optical    power.      Consequently    this  interact 
"degenerate".      Optical   power   will  degenera 


that   K   ^    i s    also  momentum  matched    from  the 
K(001)    axis,    back   to    the   n ^  surface. 
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in    the   Fourier    transform  plane   both  positive 
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th  K      and   K _   output   beams,  right 
input   vector   K    ^.      These   output  beams 
ative    first    o  r  §  e  r   diffraction.  Consequently, 

and   conjugate   negative    spectra   can  be 


TP    2-D  Ambituity  Processor 
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design   flexibility    is    increased    considerably.      Figure  5 
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A.O.      device    1    is   operated    in    the    tangential   mode.      Input    illumination    is  polarized 
right    elliptical,    therefore   matching    to    the   n ^    index   surface.      With  K      chosen  to 
match   t ang en t i a  1 1 y  ,    the    first   order   deflected    rays    are    left    elliptically    polarized,  and 
coaxial    through   the    telecentric    system.      These   enter    the    second   device   perpendicular    to  its 
optical    face    in   both  dimensions.      Notice   that    perpendicular    input    to    the    first   device  is 
maintained    in    the    top   view.      Consequently,    the    follow   advantages  surface. 
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Results 


A   tangential/degenerate   mode   regime    two   dimensional   optical   processor  for 
t ime / f r e q u en c y   ambiguity    function   processing   was   built    and    tested.      Output    data    is    shown  in 
figures    6    through    17.      Two   T  e  0  2   devices   were   built   with   the   crystal    cut  per 
specification   of    figure   4  .      The    tuning   network   was    centered   at    the   degenerate  matching 
point,    37.5   MHz,    with   20  MHz   bandwidth.      The    time   aperature   of    each   device  was 
approximately    6   usee.      TB=180.      Output    data  was    taken    from   an   RCA   CCD   detector   array  with 
geometry    2  44   x    190   pixels    at    60  fields/second. 


Examples   of    ambiguity    functions    using   TD    2-D   optical  processor: 
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Figure    7:      Inputs    are    the   same   as    in    figure   6    however    the  modulation   on    the    signal  to 
the   second   cell    has    been    increased    to    10  MHz.      As    can   be  seen    in    the    figure    the  line 
separation   between    the    lines   has    increased    to    20  MHz. 
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OUTPUT:         1  MHz  DOPPLER  SHIFT 


OUTPUT:        10  MHz  DOPPLER  SHIFT 


Figure   6.        10   us    pulse   width,  20 
us    pulse  repetition 
interval  (PR1) 


Figure  7 


10  us  pulse  width,  20 
us  PRI 


• 


OUTPUT:      NO  DOPPLER  SHIFT 


OUTPUT:      NO  DOPPLER  SHIFT 


Figure   8.        3   us   pulse  width,  10 
us  PRI 


Figure   9.        1    us    pulse  width,  10 
us  PRI 


OUTPUT:      NO  DOPPLER  SHIFT 


OUTPUT       NO  DOPPLER  SHIFT  (BED  OF  NAILSI 


Figure    10.        100   ns   pulse  width, 
2.5   us  PRI 


Figure    11.        100   ns    pulse  width,  1 
us  PRI 
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i  ft 

#  # 
t  * 


OUTPUT: 


3.0  MHz  DOPPLER  SHIFT 


Double   pulse  input: 
500   ns    pulse  width,  700 
ns    pulse   spacing,    3.5  us 
double   pulse    set  rep. 
rate 


Figure  13 


Double  pulse  input: 
500  ns  pulse  width, 
ns  pulse  spacing,  2 
double   pulse  rep. 


600 
u  s 
rate 


400  ns  PULSES 

4/jsec  REPETITION  RATE 

•  14  MHz  DOPPLER  SHIFT 


Figure  14. 


•  WW 

Mil  «» 

400  ns  PULSES 

GEN.  B: 

2  v sec  REPETITION  RATE 

'4  MHz  DOPPLER  SHIFT 

Figure 

•ic  "k 

1  5  . 

4MB 

WH 

5  5 

• 

UNCORRELATED 


CORRELATED 


Figure    16.*"  Figure  17. 

**  .  . 

Figures    6    -    17:      horizontal   axis    represents    frequency   correlation,    vertical  axis 

represents    time   correlation:      f(x),t(y)    ambiguity  function. 
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Figure 

8 

Here  the 

input    pulse  width   has    been  decrea 

s  ed 

to 

3   us  t 

o  show 

that  along 

the    time  cor 

r 

el 

ation  axis 

,    the    triangular   envelope  has  deer 

ease 

d 

to   6  us 

as    e  x  p 

ec  t  ed .  In 

addition  the 

w  i 

dth   of  the 

ambiguity    function   has  increased 

t  0 

+  / 

-    lob  .  j 

kHz. 

No  CW 

"Doppler"  mo 

d 

u  1 

ation  has 

been   placed   on    either    input  signa 

1  . 

Figure 

9 

The  input 

pulse   width    is    again   decreased  t 

o  1 

u  s 

P  R I 

remains 

the    same  as 

in    the  above 

ex 

amples  at 

10   us    to   keep   adjacent  ambiguity 

f  u  n  c 

t  i 

o  n  s  out 

of  the 

window  of 

the    2-B  proc 

e 

s  s 

or.  Notic 

e    in    this    figure    that    the  output 

amb  i 

i  t  y  fun 

c  t  i  o  n  h 

as  again 

decreased  in 

le 

ngth  along 

the    time   correlation   axis    to    2  u 

s  an 

2 

incr eas 

e  d    in  t 

he  frequency 

domain.  Car 

e 

f  u 

1  examinat 

ion   of    the    figure   will    reveal  the 

(si 

n 

x  )  /  x  si 

d  e  lobe 

structure  of 

t  h 

e  function 

The   distance    to    the    first  zero 

s  of 

t 

he    f  un  c 

t  i  o  n    a  1 

ong    t  he 

frequency  ax 

i 

s 

is    +/-  500 

kH  z  . 

Figure 

1 

0  : 

This    a  m  b 

iguity    function    is    generated   by  i 

n  s  e  r 

t  i 

on    of  1 

00   n  s  p 

u  1  s  e  s 

separated  by 

o 

L  • 

J    us.  not 

ice   that    the    figure    shows    3  brigh 

t  CO 

r  r 

elation 

s    and  a 

d  l  ii.  m  e  x  one 

(just  outsid 

e 

t 

he  window 

of    the    CCD   array).  Considerable 

s  p  r  e 

ad 

i  n  g  in 

f  r  e q u en 

c  y  is 

observed  on 

e 

ac 

h  correlat 

ion.      The    zeros    in    frequency  corr 

e  s  p  o 

nd 

to   +  /  - 

10  MHz 

Pulse 

separation  i 

s 

2 

.5    us.  Mo 

C H   modulation    for    Doppler   was  pi 

ace 

on 

this  w 

a  v  e  f  o  rm 

Figure 

1  1  : 

Here    the    PRI  has 

been 

reduced    to    1  us. 

Notice  t 

hat   nulls    along  the 

frequency  ax 

i  s 

on   each  ambiguity 

f  un  c  t 

ion   are    starting  t 

o  appear. 

The    spacing   of    the  nulls 

corresponds 

t  0 

500   kHz   or   one  over 

tw  i  c 

e    the    inverse  PRI. 

This  so 

rt    of    function    is  commonly 

referred  to 

a  s 

a    "bed   of   nails"  ambigu 

ity  function. 

Figure 

1  2  : 

This    amb  iguity  f 

un  c  t  i 

on    is    generated  by 

input  ing 

to    the  optical    processor  a 

repetitive  d 

o  u  b 

le   pulse  sequence. 

Eac 

h    pulse    is    500  ns 

wide  and 

separated    from  its 

adjacent  pul 

s  e 

by    700   ns  (leading 

edge 

to    leading  edge). 

Each  do 

uble    pulse    pair  is 

separated  by 

3  . 

5   us.      This  signal 

is  f 

ed    to   both  A.O.  d 

evices  wi 

th   a   9   MKz  modulation 

mixed   on  the 

w  a 

veform   before  the 

s  ec  on 

d  A.O.      device.  K 

otice  six 

clusters    of  correlation 

info  rma  t  ion 

i  n 

the    figure.  Each 

c  1  u  s  t 

er    represents  the 

amb  iguity 

function   of  any 

particular  d 

o  ub 

le    pulse  sequence. 

As 

can    be    seen    a  long 

the  time 

domain   axis,    3  peaks 

within  each 

are 

present.      For  dou 

b  1  e  p 

ulse    inputs,    the  e 

xpected  o 

utput    should    be  three 

peaks.  Also 

no 

tice   that    each  clu 

s  t  e  r 

is    separated   by  3. 

5    us    in  t 

he    time   domain.      Of  course 

two    sets  of 

t  hr 

ee    exist    in  freque 

ncy  s 

eparated   by    18  MKz 

or  twice 

the    input  "Doppler" 

mo  au 1  a  t  i  on  , 

(  +  / 

-   9MHz) .      Finally , 

no  t  i 

ce    the    fine  fringe 

structur 

e    (running  vertically). 

The   null  spa 

c  i  n 

g   of    these  fringes 

corr 

esponds    to   one  ove 

r    twice  t 

he    double    pulse  set 

repetition  r 

ate 

Figure 

13:      This  photo 

was    t  a 

k  e  a  i 

r  o  m  the 

out 

put  of 

t 

he 

CCD   for   an    input  waveform 

cons  is  t  ing  o 

f    a    repetitive  d 

oub  1  e 

pulse 

s  equen 

c  e  s 

l  in  i  1  a  r 

t 

o  t 

hat    shown    in    figure    12.  The 

difference  i 

s    that    the    500  n 

s  puis 

e  s    a  r 

e    s  e  p  ar 

a  t  ed 

by  600 

n  s 

and   the   double    pulse  repetitin 

rate    is  lowe 

red    to    2   us    so  a 

s    t  o  a 

1  low 

the  cor 

r  e  1  a 

t  i  o  n    c  1 

u 

s  t  e 

rs    to    run    together   and  produce 

interference 

S  e  q  u  en  c 

e   figures  14,15, 

16  &17 

Here    two  sig 

nal  generators  ar 

e  used 

to  d 

emon  s  t  r 

ate 

the   ab  i 

1 

ity 

of    the   processor    to  correlate 

on   PRI  indep 

endent    of  Dopple 

r  .  Th 

e  amb 

iguity 

f  u  n  c 

t  i  o  n    p  r 

0 

due 

ed    from  generator   A    is  shown 

in    figure  14 

The    input  wav 

e  f  o  r  it. 

c  on  s  i 

s  t  s  of 

400n 

s  pulse 

s 

s  e 

parated   by    4   us.  CW 

mo  dulation  o 

f    the    input   bef o 

re  the 

seco 

nd  cell 

i  s 

1  4   MH  z  . 

The  sec 

ond   generator  is 

set  a 

t   a  2 

us  PRI 

(  +  / 

-  10%), 

wit 

h   400  ns    pulses,    and   a   4  MHz 

modu 1  a  t  ion  ( 

asynchronous  to 

the  14 

KH  z 

o  s  c  l  1  1  a 

tor) 

on  the 

wav 

eforn.   to    the    second  cell. 

Khen    the  two 

signals    are  add 

e  d  at 

each 

cell,  t 

he  c 

o  m  p  o  s  i  t 

e 

o  u 

tput    ambiguity    function   can  be 

seen   as    in  f 

igure    16.  Notic 

e  that 

the 

two  amb 

igui 

t  y    f  u  n  c 

t 

ion 

s    of    figures    14   6.  15 

superimpose  . 

Also    notice  th 

at    v  e  r 

t  i  c  a  1 

backgr 

o  ud 

s  m ears 

a 

r  e 

apparent.      Since    the   PRI's  of 

the    two  gene 

rators   are   not  1 

o  c  k  e  d  , 

the 

smears 

r  e  p  r 

e  s  e  n  t  c 

0 

r  r  e 

lation   peaks    that    are  moving 

across    the  a 

perture   during  t 

he  CCD 

i  n  t  e 

g  r  a  t  ion 

t  ir.c 

e  .  How 

e 

v  e  r 

,    when    the    two   PRI's  are 

locked,  i.e. 

generator    B  at 

exact 

ly  2 

us,  the 

c  r  o 

s  s  corr 

e 

lat 

ion   between    the    two  waveforms 

"pops"    out  0 

f    the  background 

Thi 

s  can 

aid  in 

ide 

n  t  i  f  i  c  a 

t 

ion 

of   various   waveforms    in  the 

env ironment 

when    they   have  b 

e  en   p  r 

e  v  i  o  u 

sly  unk 

nown 

Summary 


The    d  e 

si^n   of  aco 

u 

s  t  o  -o 

P 

tic 

process 

ing  systems, 

e  s  p  e  c 

ially 

i  n 

two 

d  i  n.  e  n  s 

ions, 

s  ho 

uld 

include  opt 

imization  o 

f 

the 

A 

.  0  . 

device 

for    the  arc 

h  i  t  e  c  t 

u  r  e  . 

No 

t  on 

1  y  does 

this 

inc 

1  ud  e 

the    input  t 

un  ing   ne  two 

r 

k  and 

e  1  e 

ctrode/t 

ransducer,  b 

ut  mos 

t  imp 

or  t 

ant  , 

the  an 

gular 

cut 

of 

the  crystal 

bulk.  If 

t 

he   d  e 

s 

ign 

eng  inee 

r    is  forced 

to  use 

"off 

-t  h 

e-s  h 

elf"  Br 

agg  c 

ells 

,  his 

system  may 

exibit  poor 

p  e  r  f  o 

r 

man 

c  e   e  s  p  e  c 

ially    in  two 

d  imen 

s  ions 

wh 

ere 

s  y  s  t  em 

const 

rain 

t  s 

today  push 

the  photon 

b 

ud  g  e  t 

1  im 

it    ( s  p  e  e 

d)  .      This  al 

so    a  p  p 

lies 

to 

the 

s  y  s  t  em 

cont  r 

ac  t  o 

r  . 

The   s  y  s  t  e  m 

should  not 

b 

e    s  p  e 

c 

if  i 

ed   wit  ho 

ut    study  of 

the   v  a 

r  i  o  u  s 

a  r 

chit 

ec  tur a  1 

p  o  s  s 

ibi  1 

i  t  i  e  s 

from   a   d  e  v  i 

ce  viewpoin 

t 

at  t 

h 

e  c 

r  y  s  t  a  1  d 

esign  level. 

As  s 

hown 

in 

this 

paper  , 

de  v  i 

c  e 

optimizatio 

n    can  produ 

c 

e   e  x  c 

e 

pt  i 

o  n  a  1  res 

u  1 1  s  . 
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Evaluation  of  spatial  filters  for  background  suppression 
in  infrared  mosaic  sensor  systems 


T.  L.  Bergen,  P.  K.  Mazaika 

Systems  Modeling  and  Analysis  Department,  The  Aerospace  Corporation 
2350  East  El  Segundo  Boulevard,  El  Segundo,  California  90245 

Abstract 


Spaceborne  infrared  mosaic   sensors  have  been  proposed  for  future  surveillance  systems. 
Because  these  systems  will  generate  a  large  volume  of  data,   background  suppression  will 
require  algorithms  which  use  innovative  architectures  and  minimal  storage. 

This  paper  analyzes  the  implementation  and  performance  of  candidate  temporal  and  spatial 
filters.     Spatial  filters  are  attractive  because  they  require  far  less  memory,    can  effectively 
exploit  a  parallel,    pipelined  architecture,    and  are  relatively  insensitive  to  target  speed. 
However,   the  performance  of  spatial  filtering  is   substantially  worse  than  that  of  temporal  filtering 
when  the  sensor  has  good  line-of-sight  stability. 

Introduction 


Rapid  advances  have  been  occurring  in  VLSI  (Very  Large  Scale  Integration)   development  for 
commercial  applications;  DOD1  s   VHSIC  (Very  High  Speed  Integrated  Circuit)  program  will  spur 
comparable  development  of  VLSI  chips  for  military  use.      The  expected  gains   in  circuit  density, 
device  speed  and  the  reduction  of  power  consumption  make  digital  processing  potentially  applicable 
to  the  signal  processing  tasks  associated  with  a  mosaic  focal  plane.     Several  types  of  background 
suppression  algorithms  have  been  proposed  for  the   real  time  detection  of  point  targets  using  mosaic 
sensors.  The  background  suppression  algorithms  must  compress  a  large  volume  of  data  into  a 

few  candidate  target  points  without  suppressing  any  real  targets.     Thus,   the  algorithms  must  be 
efficient  in  speed  and  storage  requirements,   while  giving  good  target-to-clutter  performance.  This 
paper  compares  the  computational  requirements  and  performance  characteristics  of  two  generic 
types  of  background  suppression,    spatial  and  temporal  filtering. 

Part  I  discusses  possible  architectures  for  both  spatial  and  temporal  digital  filters,  using 
ultra-high-speed  signal  processors  designed  by  Swartzlander  and  Gilbert.  ^»  ^     Special  architectures 
can  take  advantage  of  the  high  degree  of  parallelism  inherent  in  signal  processing  algorithms  to 
achieve  much  higher  throughput  rates  than  can   be  achieved  in  a  general  purpose  computer.  The 
heart  of  these  processors  is  the  inner  product  computer.  5 

Part  II  compares  the  performance  of  the  Laplacian  and  point  detection  spatial  filters  with  the 
performance  of  first  and  second  temporal  differencing  in  the  presence  of  line-of-sight  motion. 
Clutter  suppression  ability  is  evaluated  by  simulations  of  the  filters  across  background  scenes 
taken  from  Skylab  photographs  of  the  earth,   while  target  response  is  evaluated  analytically 
assuming  a  "worst  case"  crossing  geometry  on  the  focal  plane.      Filter  performance  is  then 
measured  by  the  ratio  of  target  response  to  RMS  clutter.      Since  clutter  suppression  in  a  temporal 
filter  depends  on  the  LOS  drift  rate,    there  is   some  drift  rate  for  which  the  first  temporal  differ- 
encing performance  equals  the  spatial  filter  performance;  this  is  defined  as  the  drift  equivalent 
clutter  of  the  spatial  filters.      The  drift  equivalent  clutter  concept  makes   it  easy  to  compare  the 
relative  performance  of  these  spatial  and  temporal  filters. 

For  further  information  on  candidate  background  suppression  filters  and  their  performance, 
the  reader  is   referred  to  the  literature.  1»  2, 6-10 

Signal    processing  architecture 

Temporal  difference  filters 

Temporal  difference  filters  can  be  written  as   vector  inner  products,    a  form  particularly 
suited  to  implementation  in  digital  hardware.      For  first-order  differencing 

yt  =  (l,  -i)  •  (xt,  xt_j)  =  xt  -  xt  l  (i) 

where  x^_  is  the  intensity  on  a  detector  at  time  t  and  y^  is  the  filtered  value. 
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For  second-order  differencing,    the  corresponding  formula  is 


yt  =  (-1/2,  1, 


•1/2) 


(xt'   Xt-1'  Xt-2) 


=   -1/2  (xt  +  xt_2)   +  xt  l 
while  for  third  order  differencing  it  is 


yt  =  (-1/3,    1,    -1,  1/3) 


(xt, 


1/3  (xt  -  xt_3)   +  (xt_j 


t-r  t-2' 


Xt-2} 


Xt-3> 


(2) 


(3) 


th 

A  straightforward  parallel  hardware  implementation  of  an  n    -order  temporal  difference  filter  would 
consist  of  n+1  parallel  multipliers  followed  by  a  binary  tree  of  n  adders   in  K  =  flog     (n+1)]  +  1 
stages  ([x]  equals  the  greatest  integer  in  x) ,    as   illustrated  in  Figure  la  for  third-or^er  differencing. 
Using  the  symmetry  of  the  coefficients,   we  can  halve  the  number  of  multipliers  needed,    as  shown  iii 
Figure  lb. 


1/3      x         J       x  i       x  1/3 

t-1  t-2  t-3  ' 


+ 


+ 


-x  x 
t  t-3 


w 


Xt-1  "Xt-2 


w 


yt  =   -1/3  xt  +  xt_j   -  xt_2  +  1/3  xt_3 
(a) 


yt  =   1/3  (-xt  +  xt_3)  +   1   (xtl   -  xt 
(b) 


Figure  1.     Two  architectures  for  third-order  temporal  differenci 


mg 


Pipelining  this  algorithm  will  increase  the  throughput  rate.      Registers  are  installed  between  the 
multiplier  and  adder  stages  so  that  adders  and  multipliers  can  work  simultaneously  on  different 
data.      The  rate  at  which  data  is  transferred  between  stages  is  the  delay  time  of  the  slowest  stage 
and  is   known  as  the  machine  cycle.     The  inner  product  of  the  first  data  presented  to  the  processor 
will  be  completed  in  three  machine  cycles,   the  latency  period  of  the  pipeline.     Thereafter,   a  new 
inner  product  will  be  completed  each  machine  cycle.     Thus  for  a  K-stage  pipeline  with  a  machine 
cycle  t,   the  latency  period  is  Kr,   while  the  time  T^  necessary  to  form  N  inner  products  is 


N 


=  Kr  +  Nt 


(4) 


In  continuous  operations,  N  becomes  much  larger  than  K,  so  the  average  computation  time  per  inner 
product  approaches  the  machine  cycle  time,  i.e., 


avg 


N 


r  + 


—  T 

N 


(5) 
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To  arrive  at  an  estimate  of  the  processing  capability  required  to  implement  temporal  differencing 
on  a  mosaic  focal  plane  we  make  the  following  assumptions; 

1)  The  focal  plane  consists  of  D  detector  cells  arranged  as  chips  of  (/  x  I)  cells. 

2)  The  focal  plane  is  partitioned  into  S  segments  of  M  chips.     Each  segment  has  an 
independent  processor. 

3)  The  focal  plane  is  sampled  once  per  second. 

4)  The  analog  intensities  are  converted  to  an  eight-bit  intensity  word  before  any  differencing 
operations. 

th 

We  consider  the  processing  required  for  one  segment.  For  n  -order  differencing,  the  n  previous 
scans,  each  containing  8i2M  bits  must  be  saved.  In  the  implementation  of  Figure  lb,  the  number 
of  operations  (multiplies  and  adds)  which  must  be  performed  per  scan  is 

(n  +  [n/2]    +   DM/2  <6> 

By  pipelining  operations  and  using  [n/2]  +   1  parallel  multipliers,   we  find  that  a  machine  cycle  of 
I/Ml*  sec  will  achieve  the  desired  throughput  rate.      Note  that  the  higher  throughput  rates  required 
by  higher-order  differencing  are  attained  with  greater  parallelism  rather  than  a  faster  machine 
cycle.      Table   1   summarizes  the  hardware   requirements  for  first  and  second  temporal  differencing 
for  one  segment  of  2xl05  cells.     This  assumes  a  cycle  time  of  4.5   /isec.  For  32x32  cell  chips, 
220  chips  could  be  processed  in  one  segment.  The  large  memory  requirement  is  the  principal 

disadvantage  of  temporal  filtering. 

Table   1.     Hardware  Requirements  for  First  and  Second  Order  Temporal  Differencing 

5 

One  segment  of  220   32x32  chips  (2.2x10     cells)  with  8-bit  intensity  word 


Capacity  Required 

First  Order 

Second  Order 

Memory 

Analog -to  -Digital 
Converter 

Inner -product 
Proces  sor 

1.6  M  bits 
2.2x10"'  words/sec 

0.2  MOPS 
(1  operation     per  inner 
product) 

3.2  M  bits 
5 

2.2x10  words/sec 

0 .  7  MOPS 
(3   operations  per  inner 
product) 

Spatial  filters.  Spatial  filtering  can  also  be  implemented  as  an  inner  product.  A  general 
(n  x  n)   spatial  filter  (for  n  odd)   can  be  represented  as 


pq 


(n-l)/2 
i=-(n-l)/2 


(n-l)/2 
j  =  -(n-l)/2 


h..x   ,.  ,. 
ij  p+i,  q+j 


(7) 


where  the  h. .  are  the  filter  coefficients  and  x..  is  the  output  of  the  (i,j)  detector.  We  are 
interested  inJ  a  3x3  filter  with  symmetric  coefficients 


H  = 


hl 

h2 

hl 

h2 

h3 

h2 

hl 

h2 

hl. 

(8) 
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Then,   for  example, 

Y22=  hixn  +  Vl2  +  hlX13  +  h2X21  +  h3X22  +  h2X23  +  hlX31  +  h2X32  +  hlx33 

(9) 

=  hjtxj,  +  x13  +  x31  +  x33)  +  h2(x12  +  x21  +  x23  +  x32)   +  h3x22 

A  hardware  implementation  of  this  scheme,   using  three  multipliers  and  eight  adders,    is   shown  in 
Figure  2.      This  processor  is  more  complex  than  that  required  for  temporal  differencing;  however, 
if  it  operates  with  the  same  cycle  time  as  the  temporal  processor,    it  will  handle  the  same  number 
of  chips  per  scan. 


Figure  2.      Inner -product  processor  for  spatial  filtering 

Spatial  filtering  does  not  require  data  from  previous  scans  so  the  memory  required  is  far  less 
than  for  temporal  differencing  with  the  advantage  growing  as  more  chips  can  be  processed  in  one 
segment.     If  the  chips  in  one  segment  are  processed  sequentially,   analog  storage  for  one  chip  and 
digital  storage  for  three  of  its   rows  are  needed.     The  analog-to -digital  converter  must  also  have 
registers  for  one  row  of  each  chip.     This  allows  the  analog-to -digital  conversion  of  row  i  +  3  to 
proceed  simultaneously  with  the  processing  of  row  i  +  1   (which  uses  stored  rows  i,    i  +  1  and  i  +  2) 
as  shown  in  Figure  3.      Table  2   summarizes  the  hardware  requirements  for  spatial  filtering  under 
the  same  assumptions  that  were  used  in  Table   1.     (The  8000  bits  listed  as  memory  refer  to  an 
alternate  scheme  with  digital  rather  than  analog  storage.) 
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row  i+3 


M  chips 


a  l„  chiP 
jfxi  cells 


focal  plane 


analog  storage 


rows   i,    i+ 1 ,  i+2 


digital  storage 


A-to-D 
converters 
and  output 
registers 


inne  r-product 
proces  so  r 


Figure  3.     Storage  scheme  for  spatial  filtering  one  segment  of  a  mosaic  focal  plane 

Table  2.     Hardware  Requirements  for  3x3  Spatial  Filters  with  Symmetric  Coefficients 

5 

One  segment  of  220   32x32  chips  (2.2x10     cells)  with  8-bit  intensity  word 


Capacity  Required 

Memory 

Analog  -to  -Digital 

Inne  r-product 
Processor 

3 

10     words  analog/CCD  (or  8000  bits) 
5 

2.2x10  words/sec 

2.  5  MOPS 
(11  operations  per  inner  product) 

A  possible  drawback  to  spatial  processing  is  that  the  detector  responsivities  may  vary  within  a 
chip.     The  severity  of  the  problem  depends  on  the  distribution,   within  the  chip,   of  the  cell  respon- 
sivities.     If  the   responsivities   vary  smoothly  and  slowly  over  the  chip,    they  will  be  relatively 
constant  over  a  small  area  and  the  nine  intensity  values  used  in  the  3x3  spatial  filter  will  not  need 
to  be  corrected  for  responsivity  variation  before  spatial  filtering.     Thus,    even  if  there  is  substantial 
responsivity  variation  across  a  chip,    spatial  filtering  will  be  successful  as  long  as  neighboring 
detectors  have  nearly  the  same  responsivity.     On  the  other  hand,    if  responsivities   vary  sharply 
from  cell  to  neighboring  cell,    intensities  must  be  normalized  prior  to  filtering.      This  will  require 
storing  a  responsivity  correction  for  each  cell  and  might  obviate  the  memory  advantage  of  spatial 
proces  sing. 


Signal  processing  performance 

The  performance  of  a  background  suppression  filter  depends  on  both  its  clutter  suppression 
ability  and  its  target  response.     We  first  evaluate  the  clutter  suppression  characteristics  of  temporal 
and  spatial  filters.     The  target  response  characteristics  of  these  filters  are  then  derived.  The 
results  of  these  analyses  are  combined  to  evaluate  the   relative  performance  of  the  filters. 

Clutter  suppression  characteristics 

To  evaluate  the  clutter  suppression  characteristics  of  temporal  and   spatial  filters,   a  generalized 
filter  simulation  program,    BACFILT,   was  developed.     BACFILT  applies  the  (m  x  n)   filter  <p  =  \(..] 
to  the  image  B  =  [t>^]  with  an  output  given  by 

q+n-1  p+m-1 

Vl   =         £  £         fijbij  (10) 

j=q  i=p 
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The  filter  T  can  be  chosen  to  simulate  spatial  filters,   temporal  filters  in  the  presence  of  sensor 
drift,    and  various  combinations  of  filters  and  filters  with  optics.     The  program  also  can  simulate 
a  range  of  detector  footprint  sizes.     In  this  analysis  BACFILT  was  used  to  filter  seven  Skylab 
images,    each  of  which  had  been  digitized  as  a  512x512  pixel  array.     The  images  were  chosen  to 
show  a  variety  of  spatial  structures:     two  of  the  images  are  shown  in  Figure  4.      Different  size 
detectors  were  simulated  with  footprint  sizes  of  2x2,    3x3  and  5x5  background  pixels.     We  assumed 
a  perfect  sensor  optics    and  a  "snap  shot"  detector  with  a  short  integration  time  relative  to  the 
frame  time. 


(a)   Florida  (2,3) 


(b)   Canyons  (6,  3) 


Figure  4.     Two  of  the  Skylab  images  used  in  analysis 

Temporal  filters.     First-order  temporal  filtering  subtracts  the  image  at  time  t-1  from  the  image  at 
time  t.      If  the   sensor  line-of-sight  could  be  held  perfectly  stationary,    first  temporal  differencing 
would  remove  all  stationary  background.      Because  the  sensor  drifts,    the  filter  passes  some  clutter 
from  stationary  background.     We  consider  a  sensor  drifting  one  pixel  per  frame,    in  a  direction 
parallel  to  the  rows  of  the  image.      For  perfect  optics  and  a  snapshot  detector,    the  detector  output, 
x,    is  proportional  to  the  sum  of  the  pixel   radiant  intensities,    b,    within  the  footprint.      Thus,    for  a 
detector  footprint  size  of  (k  x  k)   pixels   initially  located  such  that  its  upper  left  hand  corner  is  pixel 
(p,  q)  of  the  image  (see  Figure  5), 


pq 


q+k-1 


j=q 


p+k-1 

Z 

i=p 


(H) 


The  proportionality  factor,  T),  depends  on  telescope  aperture,  optical  transmittance,  detector 
responsivity,   and  integration  time.     For  notational  simplicity,   we  set   T)  =  1. 
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initial  location 


1  1  1 

/ 

b 

pq 

b 

p,  q- 

1 

one  pixel  displacement 


Figure  5.     Simulated  detector  footprint  of  size  3x3  pixels  before  and  after  one  pixel  displacement 
When  the  same  footprint  is  displaced  by  one  pixel,   the  detector  output  will  be 


q+k  p+k-1 

*  x!  =  y  y  b- 

p,q+l          Z-»  Z-/  lj 
j=q+l  i=P 

The  corresponding  change,  Ax,    in  detector  output  is  then  given  by 

Ax       =  x       ,  ,  -  x 

pq         p,q+l  pq 


(12) 


p+k-1 

L(b.  ,  -  b.  ) 
v  l,  q+k  iq' 

i=p 


(13) 


In  terms  of  BACFILT,    this  is  equivalent  to  applying  the  k  by  k+ 1  filter 
-1     0     ...     0  1 


-1  0 


0  1 


(14) 


The  processor  output  y    at  time  t     of  first  temporal  differencing  is  exactly  Ax  when  the  drift 
rate  is  one  pixel/sample.     (The  dependence  of  y^  on  drift  rate  will  be  discussed  later.)  Thus, 
the  statistics  of  Ax  over  the  image  determine  the  statistics  of  the  filtered  outputs.      In  particular, 
the  RMS  drift  clutter  is  proportional  to <J<Ax^>,   where  <Ax  >  denotes  the  ensemble  average  over 
the  image.     For  a  given  image,   the  statistics  of  Ax  are  a  function  of  footprint  size.     Table  3 
lists  the  normalized  drift  clutter,  J<AxZ>Io-q  of  the  images  for  three  different  footprints,  where 
<7g  is  the  standard  deviation  of  the  background  pixel  intensities  for  the  unfiltered  image. 
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Table  3.     Normalized  drift  clutter  with  a  drift  rate  of  one  background  pixel  per  frame 


^~~^-~-^o  o  t  p  r  i  nt 
Image 

First 

-orde  r 

differencing 

Second -o  rde  r 
differenc  ing 
5x5 

2x2 

3x3 

5x5 

Mt.  Vesuvius 

.40 

.  74 

1.  55 

.37 

Clouds  (3,  1) 

.62 

1.  15 

2.  8 

.  59 

Clouds  (1,3) 

.  33 

.  60 

1.25 

.  32 

Florida  (2,  3) 

.  87 

1.  58 

3.  35 

.  84 

Florida  (4,2) 

1.08 

2.08 

4.  62 

1.02 

Canyons  (3,4) 

.86 

1.  68 

3.65 

.  81 

Canyons  (6,  3) 

.69 

1.  37 

3.  1 

.67 

Other  orders  of  differencing  are  similarly  derived.  Second  differencing  is  one-half  the  difference 
between  successive  first  differences  and  is  defined  by 


4  x       =   1/2  (Ax         .    -  4x  ) 

pq  p,q+i  pq 

=   1/2  (x  -  2x         .   +  x  ) 

V  p,q+2  p,  q+1  pq' 


(15) 


p+k-1 

=  1/2  E  (bi, 
i=p 


,    -  b.       .    -  b.        .   +  b.  ) 
l,  q+k+1         l,  q+k         i,q+l  iq 


This  corresponds  to  the  k  by  k+2  filter 

1/2      -1/2      0       ...       0       -1/2  1/2 

F  = 


1/2      -1/2  0 


1/2  1/2 


(16) 


The  noise  level  in  the  original  Skylab  photographs  dominates  the  clutter  calculations  for 
second-order  differencing  for  the  2x2  and  3x3  footprints.     The  calculations  for  the   5x5  footprints 
are  listed  in  the  last  column  of  Table  3. 

Spatial  filters.      Spatial  filters  use  data  from  only  one  scan.     Assuming  a  short  integration  time, 
sensor  drift  can  be  neglected  in  evaluating  the  performance  of  a  spatial  filter.     The  two  3x3  spatial 
filters  most  relevant  to  point  target  detection  are  the  point  detection  and  Laplacian  filters  illustrated 
in  Figure  6. 


1/8   -1/8  -1/8 
1/8      1  -1/8 
■1/8    -1/8  -1/8 


1/4      -1/2  1/4 
•1/2        1  -1/2 
1/4      -1/2  1/4 


Point  Detection  Filter  Laplacian  Filter 

Figure  6.     Spatial  filters 
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To  simulate  the  performance  of  a  3x3  spatial  filter  operating  on  detectors  with  (k  x  k)  footprints, 
we  let  the  filter 


r  = 


LF3 


(17) 


where  f  is  a  (3k  x  3k)  array  with  Fj,  F^  and  F^  being  (k  x  k)  submatrices  of  constant  value, 
instance,   the  Laplacian  filter  with  footprint  size     2x2  would  be  represented  as 


T  = 


Fo  r 


1/4 

1/4 

-1/2 

-1/2 

1/4 

1/4 

1/4 

1/4 

-1/2 

-1/2 

1/4 

1/4 

-1/2 

-1/2 

1 

1 

-1/2 

-1/2 

-1/2 

-1/2 

1 

1 

-1/2 

-1/2 

1/4 

1/4 

-1/2 

-1/2 

1/4 

1/4 

1/4 

1/4 

-1/2 

-1/2 

1/4 

1/4 

(18) 


in  BACFILT. 


The  background  clutter  passed  by  the  spatial  filter  is  the  RMS  of  the  filtered  image.  /<y  >, 
where  y      is  as  defined  in  equation  (10).     Table  4  lists  the  normalized  clutter,  J <y^"  >/<7g»  passed 
by  the    ^  point  filter  and  Laplacian  filter  for  the  seven  Skylab  images  with  three  ^ 
different  footprint  sizes. 

Table  4.     Normalized    clutter  for  spatial  filters 


— Footprint 
Image 

Point  Filter 

Laplacian  Filter 

2x2 

3x3 

5x5 

2x2 

3x3 

5x5 

Mt.  Vesuvius 

.  44 

1.24 

4.  30 

.  34 

.93 

3.  54 

Clouds  (3,  1) 

.  75 

2.  31 

7.  90 

.47 

1.48 

6.  88 

Clouds  (1,3) 

.  40 

1.  16 

4.47 

.  32 

.77 

3.03 

Florida  (2,  3) 

.93 

2.  70 

9.  78 

.79 

1.  87 

7.75 

Florida  (4,2) 

1.  10 

3.  63 

13.95 

.65 

2.07 

10.79 

Canyons  (3,4) 

.95 

3.09 

11.65 

.  58 

1.76 

8.  54 

Canyons  (6,  3) 

.  70 

2.  35 

9.  59 

.46 

1.29 

6.74 

Target  response  characteristics.      We  now  turn  to  evaluation  of  the  target  response  of  the  candidate 
filters.     To  compare  the  two  types  of  filters,   we  will  use  a  baseline  example  that  includes  "matched1 
optics,    a  poor  crossing  geometry,    and  a  range  of  target  velocities  (the  fastest  being  four  times  the 
slowest).     Target  response  depends  on  the  relation  of  target  velocity  to  sampling  rate  for  both 
spatial  and  temporal  filters.     Our  figure  of  comparison,   the  target  gain  factor,    reflects  overall 
filter  performance  against  the  range  of  target  velocities  an  operational  system  might  encounter. 


"Matched"  optics  refers  to  the  situation  where  the  blur  circle  size  is   roughly  the  same  size  as 
the  detector.     We  model  the  image  blur  as  a  two-dimensional  Gaussian  distribution  with  a  -  30%  of 
the  detector  width.     When  the  target  image  is  centered  on  a  detector,   then  80%  of  the  target  image 
irradiance  will  be  incident  on  the  detector.     In  the  "matched"  optics  case,   the  target  dwell  time  has 
very  little  dependence  on  the  crossing  geometry  of  the  target  moving  across  the  focal   plane  --  thus, 
if  only  the  target  velocity  is  known,   the  dwell  time  can  still  be  estimated  fairly  accurately. 

In  the  chosen  crossing  geometry  the  target  image  moves  along  a  "seam"  of  the  detector  array 
with  50%  of  the  irradiance  falling  on  either  side  of  the  seam.     (See  Figure  7.)     If  the  signal-to- 
clutter  ratio  is  adequate  for  this  case,   then  detection  is  assured  because  more  signal  comes  through 
in  other  crossing  geometries.     Note  we  have  assumed  a  focal  plane  "fill  factor"  of  L,    i.e.,   no  gaps 
or  dead  space  between  detectors. 
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Figure  8.      Distribution  of  collected  energy  for  different  target  velocities 


Target  dwell  time  depends  on  detector  footprint  size  and  target  velocity.      (With  matched  optics, 
the  effect  of  crossing  geometry  on  dwell  time  can  be  neglected.)     We  consider  three  target 
velocities,    v,    2v  and  v/2;  we  sample  twice  per  dwell  for  the  target  of  velocity  v.     For  the  fast 
targets,   this  gives  one  sample  per  dwell,   while  for  slow  targets,   this  gives  four  samples  per 
dwell.     For  the  medium  speed  targets,   this  sampling  rate  maximizes  temporal  filtering  response 
for  worst  case  target  phasing. 

For  integrating  CCD  detectors,   target  response  is  measured  in  energy.     Let  T  be  the  sampling 
interval,   and  consider  a  target  moving  the  length  of  the  arrow  shown  in  Figure  8a  during  a  time  T. 
This  is  an  example  of  a  slow  target  that  moves  0.25  detector  widths /sample.     The  corresponding 
distribution  of  collected  energy  in  time  T  is  also  shown  in  the  figure.      During  this  interval,  a 
0.43  fraction  of  the  incident  energy  is  collected  by  a  single  detector  because  of  the  assumed 
optical  blur  and  crossing  geometry.     Two  samples  later  the  target  image  will  be  centered  at  the 
corner  of  four  detectors,   and  the  collected  energy  on  any  one  detector  will  only  be  0.25.  We 
will  use  the  larger  0.43  value  for  computing  the  spatial  filter  target  response  because  it  yields 
the  maximum  response  as  the  target  crosses  the  detector.     Sampling  phase,   while  not  important- 
for  this  slow  target,    is  significant  for  faster  targets;  we  will  measure  target  response  for  both 
spatial  and  temporal  filters  as  the  maximum  response  for  the  worst  case  phasing.     Figures  8b 
and  8c  show  target  image  displacement  (with  the  worst  sample  phasing)  for  the  medium  and  fast 
targets,   and  the  corresponding  distributions  of  collected  energy  averaged  over  the  sampling  interval. 

Table  5  shows  the  outputs  of  the  spatial  filters  after  convolution  with  the  collected  energy 
distribution.     The  point  detection  filter  has  twice  the  target  response  of  the  Laplacian  filter,  and 
both  filters  are  helped  by  having  more  samples  per  dwell. 

Table  5.      Relative  target  response  of  background  suppression  filters 


Target  Speed 

Target 

Filter 

Slow 

Medium 

Fast 

Gain  Factor 

Spatial 

Point  Detection 

.36 

.  27 

.  15 

.27 

Laplacian 

.  20 

.  14 

.06 

.  14 

Temporal 

1st  Differencing 

.  14 

.  21 

.  17 

.  14 

2nd  Differencing 

.06 

.  14 

.  12 

.06 
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The  results  of  Table  5  for  temporal  differencing  filters  were  derived  from  Figure  9.  This 
figure  shows  the  maximum  filtered  response,   for  best  and  worst  sampling  phasing,   as  a  function 
of  target  velocity.     The  filtered  response  is  measured  as  a  percentage  of  the  target  energy  which 
would  be  collected  in  one  sampling  interval  if  the  target  were  stationary  and  located  at  the 
position  of  peak  incident  power  on  the  detector.     The  peak  incident  power  occurs  for  the  target 
image  location  shown  in  Figure  10  when  45%  of  the  power  falls  on  each  of  the  two  adjacent 
detectors.     Thus,   the  results  in  Table  5  are  obtained  by  multiplying  the  worst  case  Figure  9 
values  by  0.45. 


Target  velocity  (detector  widths /sec) 

Figure  9.     Filter  outputs  from  temporal  differencing  versus 
target  velocity,  assuming  1  sample/second 


Figure  10.     Peak  target 
power  distribution, 
(stationary  target) 


The  table  shows  that  first  temporal  differencing  has  a  50%  better  target  response  than  second 
temporal  differencing,   and  that  second  differencing  performance  rapidly  decays  when  there  are  many 
samples  per  target  dwell  time.     Both  temporal  filters  have  a  loss  in  performance  for  faster  targets 
because  of  the  loss  in  collected  energy  from  the  decreased  dwell  time. 

The  last  column  of  Table  5  shows  the  target  gain  factors  we  will  use  for  the  different  filters. 
The  gain  factor  is  defined  as  the  worst  target  response  over  the  range  of  target  speeds  for  a  given 
sampling  rate.     Recall  that  this  table  was  set  up  assuming  a  sampling  interval  of  one-half  the  dwell 
of  a  medium  speed  target.     This  sampling  rate  is  nearly  optimal  for  the  temporal  filters,    so  the 
target  gain  for  temporal  filters  is  set  equal  to  the  poorest  target  response  at  this  sampling  rate. 
However,   the  table  shows  that  spatial  filter  target  response  improves  when  target  dwell  times  are 
long  compared  to  the  sampling  interval.     Consequently,   faster     sampling  improves  target  response. 
The  limiting  factors  on  faster  sampling  rates  are  electronic  noise  and  bulk  processing  speed,  rather 
than  target  response  itself.      For  the  purpose  of  this  study,   we  chose  a  sampling  interval  for  the 
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spatial  filters  of  one-half  the  dwell  of  the  fast  target,   twice  the  sampling  rate  used  for  the  temporal 
filters.     However,   a  more  careful  selection  of  sampling  rate  should  be  made  for  specific  applica- 
tions. 

Drift  equivalent  clutter.       The  performance  of  a  background  suppression  filter  is  measured  by  its 
signal-to-clutter  ratio  (SCR).     For  a  given  filter,   this  is  defined  as 


SCR  = 


Target  Gain 
Clutter  Leakage 


(19) 


where  the  target  gain  is  shown  in  Table  5.     For  spatial  filters,   the  clutter  leakage  is  given  in 
Table  4  and  does  not  depend  on  the  line-of-sight  drift  rate  for  a  snapshot  detector. 

The  clutter  leakage  of  the  temporal  differencing  filters  does  depend  on  the  line-of-sight  drift 
rate  of  the   sensor:    the  larger  the  drift  rates,   the  greater  the  clutter  leakage  and  the  poorer  the 
SCR.     For  first  temporal  differencing  and  small  drift  rates,   the  clutter  leakage  is  proportional 
to  drift:  11,  12 


Clutter  Leakage  =  {$ 


Normalized  drift  clutter 


(20) 


where  the  normalized  drift  clutter  is  from  Table  3;  the  drift  rate,  B,    is  the  fraction  of  footprint 
width  that  the  line-of-sight  drifts  per  sampling  interval;  and  8^  i-s  the  drift  rate  of  one  pixel/sample 
assumed  in  the  derivation  of  the  normalized  drift  clutter.  (For  example,   8^  =   1/5  for  the  5x5 

footprint.)  The  ratio  of  the  normalized  drift  clutter  to  8,  measures  the  rate  at  which  clutter  leakage 
increases  per  drift  distance.  In  order  to  use  the  simulation  results  for  the  2x2  footprint,  we  assume 
the  drift  clutter  is  linearly  dependent  on  8  {or  8  <  0.5. 

For  a  given  spatial  filter,   H,   there  will  be  some  drift  rate,   /3,   at  which  the  SCR  of  the  first 
temporal  differencing  filter  equals  the  SCR  of  the  spatial  filter; 


Target  Gain 


(H) 


Target  Gain 


Clutter 


(1st  temporal) 


(H) 


8  (Clutter..   .  IB.  ) 

H  v  (1st  temporal)  ^k' 


(21) 


We  define  8  to  be  the  drift  equivalent  clutter  of  the  spatial  filter  H.     Table  6  lists  the  drift  equivalent 
clutter  of  the  Laplacian  and  point  detection  spatial  filters  for  the  different  images  and  footprint  sizes. 
In  all  cases,   the  spatial  filters  have  drift  equivalent  clutter  values,   8,   of  roughly  0.3  to  0.5.  Thus, 
for  drift  rates  of  less  than  one-third  of  a  footprint  per  sample,   the  performance  of  first  temporal 
difference  is  superior  to  the  spatial  filters.     The  point  detection  filter  has  higher  SCR  values  than 
the  Laplacian:     even  though  the  Laplacian  suppresses  clutter  a  little  better,    it  has  only  half  the 
target  gain  of  the  point  detection  filter. 

Table  6.     Drift  equivalent  clutter  for  spatial  filters 


— Footprint 
Ima  g  e 

Point  Filter 

Laplacian  Filter 

2x2 

3x3 

5x5 

2x2 

3x3 

5x5 

Mt.  Vesuvius 

.29 

.  30 

.  29 

.  43 

.  42 

.  46 

Clouds  (3,  1) 

.  31 

.  35 

.  39 

.  39 

.42 

.  57 

Clouds  (1,3) 

.  32 

.  34 

.  37 

.48 

.43 

.  48 

Florida  (2,  3) 

.29 

.  30 

.  31 

.  45 

.  39 

.47 

Florida  (4,2) 

.  27 

.  29 

.  32 

.  31 

.  33 

.47 

Canyons  (3,4) 

.  29 

.  32 

.  34 

.  34 

.  36 

.47 

Canyons   (6,  3) 

.26 

.  29 

.  32 

.  34 

.  31 

.  42 
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The  SCR  for  second  temporal  differencing  a^so  depends  on  drift  rate,   but  for  small  values 
of  fi  the  clutter  leakage  is  now  proportional  to  fi     instead  of  fi  as  in  first  temporal  differencing. 
Thus,   the  relative  performance  of  the  first  and  second  temporal  differencing  filters  depends 
on  the  drift  rate.      The  SCR  for  second  temporal  differencing  is  given  by 

Target  Gain,-,    ,  ,  ., 
SCR     =     -j  "(2nd  temporal)  ^  (22) 

fi     Normalized  drift  clutter//^ 


where  the  target  gain  is  from  Table  5,   the  normalized  drift  clutter  is  from  Table  3,   and  fi^  =0.2 
for  the  footprint  size  of  5x5  image  pixels.      The  SCR  for  first  temporal  differencing  is  given  by- 
equations  (19)  and  (20).     Table  7  compares  the  SCR's  for  some  representative  drift  rates  of  fi  =  .01  to 
fi  =  .30.     At  fi  =  0.3,   the  first  and  second  temporal  differencing  schemes  have  roughly  the  same 
performance;  however,   the  second  differencing  algorithm  becomes  progressively  better  for  smaller 
drift  rates.      Note  that  the  results  in  the  table  assume  that  line  of  sight  drift  is  the  only  source 

Table  7.     Signal-to-clutter  ratios  for  first  and  second  temporal  differencing  filters  for  various 
drift  rates,  fi. 


\.  fi 

Image  \^ 

.01 

.05 

.  1 

.  2 

.  3 

1st 

2nd 

1st 

2nd 

1st 

2nd 

1st 

2nd 

1st 

2nd 

Mt.  Vesuvius 

1.81 

65. 

.  36 

2.  59 

.  18 

.65 

.09 

.  16 

.06 

.07 

Clouds  (3,1) 

1.  17 

42. 

.  23 

1.67 

.  12 

.42 

.06 

.  10 

.04 

.05 

Clouds  (1,3) 

2.  24 

75. 

.45 

3.00 

.  22 

.75 

.  11 

.  19 

.07 

.08 

Florida  (2,  3) 

.  85 

28. 

.  17 

1.  13 

.08 

.  28 

.04 

.07 

.03 

.03 

Florida  (4,2) 

.61 

23. 

.  12 

.94 

.06 

.  23 

.03 

.06 

.02 

.03 

Canyons  (3,4) 

.  78 

30. 

.16 

1.  20 

.08 

.  30 

.04 

.08 

.03 

.03 

Canyons  (6,  3) 

.90 

37. 

.  18 

1.48 

.09 

.  37 

.05 

.09 

.03 

.04 

of  clutter;  in  practice,    for  very  small  drift  rates  other  factors   such  as  sensor  noise  and  temporal 
background  variations  would  be  the  dominant  contributing  factors  to  clutter. 

Conclusion 

Detection  of  dim  targets  against  strongly  structured  background  requires  very  good  background 
suppression.      This  can  only  be  achieved  by  using  temporal  filtering  with  good  sensor  line-of-sight 
stability.     Against  lightly  or  moderately  structured  backgrounds,   the  high  performance  achievable 
with  temporal  filtering  is  not  crucial  to  detection  performance.     Spatial  filtering  in  these  cases 
is  an  attractive  alternative  which   requires  minimal  memory  and  an  operation  rate  easily  attainable 
with  parallel,   pipelined  inner  product  computers. 
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Abstract 

A  potential  source  of  response  nonuniformities  of  imaging  CID's  is  attributed  to  the 
transient  behavior  of  the  distributed  resistance  capacitance    (DRC)    network  associated  with 
the  row  and  column  electrodes.     The  DRC  network  is  evidenced  and  characterized  by  the  mea- 
surement of  frequency-dependent  admittance.     Simulated  response  nonuniformities  are  compared 
favorably  to  experimentally  observed  patterns  of  an  InSb  CID  array.     The  characteristics  and 
their  imposed  limitations  on  the  array  size,  pixel  density,  and  readout  rate  are  discussed. 

I.  Introduction 

A  solid  state  imaging  array  often  consists  of  two  parts,   the  detectors  and  the  charge- 
transfer  device    (CTD).    The  CTD^,  which  may  be  a  charge-coupled  device    (CCD) 2,  or  a  charge- 
injection  device    (CID) ^,   serves  the  purpose  of  processing  the  image  signals  from  the  detectors 
to  the  output.     An  ideal  CTD  should  be  able  to  process  the  signals  without  distortion.  How- 
ever,  an  early  design  InSb  CID  array4,5,6  shows  a  nonunif ormity  pattern  at  the  TV  monitor. 
It  is  shown  in  this  paper  that  the  observed  pattern  nonuniformity  is  dominated  by  the  dis- 
tributed resistance  capacitance    (DRC)   effects  of  the  row  and  column  electrodes.     This  effect 
is  a  necessary  design  consideration  for  larger  arrays  and/or  fast  readout  rates. 

In  the  following  section,   the  device  structure  is  briefly  described.     The  existence  of 
row  and  column  DRC  networks  is  derived  from  the  structure.     The  DRC  network  is  then  evi- 
denced and  characterized  by  AC  admittance  measurements.     In  section  III,  device  operation 
is  explained.     Transient  effects  of  the  row  and  column  DRC  networks  are  analyzed.  The 
simulation  approach  for  the  nonuniformity  pattern  is  established.     In  section  IV,   the  non- 
uniformity  pattern  of  a  12  x  12  array  is  simulated.     Characteristic  features  of  the  simula- 
tion are  described  and  are  correlated  to  the  experimental  data  obtained  from  the  32  x  32 
InSb  CID  array.     In  the  last  section,   the  limitations  on  the  sampling  readout  rate  and  array 
pixel  size  due  to  the  performance  nonuniformity  are  discussed. 

II.     CID  array  structure 

The  monolithic  infrared   (IR)    CID  array  used  in  this  paper  has  been  described  in  previous 
publications^ ' 5 _     Tne  first  image  was  obtained  in  May,   1980^.     This  device  has  two  levels 
of  metal  and  is  fabricated  by  five  photomasking  steps.     The  first  metal  level  on  the  thin 
gate  oxide  forms  the  columns.     The  second  metal  level  on  the  thick-gate  oxide  forms  the 
rows.     The  array  has  thirty-two  rows  by  thirty-two  columns  with  the  metalization  mask  layout 
as  shown  in  Figure  1.     The  intercept  of  each  row  with  each  column  forms  a  unit  cell.  The 
cell-to-cell  spacing  is  50  p,  in  both  x-  and  y-  directions.     For  the  device  reported  here, 
the  column  electrode  was  provided  by  a  relatively  thick  and  opaque  chromium  stripe  with  a 
20  m  x  35  p.    opening  in  the  center  of  each  unit  cell.     Sheet  resistance  versus  chromium  film 
thickness  was  provided  in  a  previous  publication^ .     The  row  electrode  was  fabricated  by  the 
evaporation  of  a  very  thin  semi-transparent  chromium  film,   followed  by  an  evaporated  thick 
aluminum  layer. 

From  the  above  description  and  as  shown  in  Figure  2,  each  row  or  column  can  be  regarded 
as  a  32-element  distributed  R-C  network.     Each  unit-cell  resistor  is  shunted  by  a  unit-cell 
capacitance.     The  unit-cell  capacitance  is  the  sum  of  the  metal-insulator-semiconductor 
(MIS)    capacitance  of  the  electrode  and  the  coupling  capacitance  between  the  row  and  column 
electrodes.     The  electrode  resistance  R  and  capacitance  C  are  defined  as  the  sum  of  the  32- 
resistances  and  the  sum  of  the  32-capacitances ,   respectively.     The  DRC  model  is  used  for 
the  a.c.7/8  an(j  transient  analysis  of  the  row  and  column  electrodes. 

The  parallel  equivalent  admittances  Cp  and  Gp/ uu    were  measured  by  HP-4257A  LCR-meter 
between  a  single  electrode  and  the  substrate  with  the  remainder  of  6  3  electrodes  shorted 
to  the  substrate.     The  measured  admittances  versus  frequency  for  a  single  electrode  follow 
the  same  curve  shape  as  expected  by  the  DRC  model^.     The  electrode  capacitance  are  19  pF 
for  the  column  and  12  pF  for  the  row  obtained  in  the  low  frequency  range. 
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Figure  1.   Photomicrograph  of  the  CID  Figure  2.     DRC  effects  in  CID  operation, 

array.     Driver  connections  to  the 
rows  and  the  preamp.   connections  to 
the  columns  are  also  shown. 

The  total  resistances  are  evaluated  as    8.67  K  Q    for  the  column  and  21.6  KG    for  the  row. 
The  RC  products  are  0.2  ^s  and  0.3  ^s  for  the  column  and  row  respectively.     Further  de- 
tailed structural  parameters,   such  as  gate  oxide  thicknesses  and  chromium  film  thicknesses, 
were  evaluated  and  found  to  be  in  good  agreement  with  the  processing  data. 

III.  Device  operation  and  simulation  model 

As  indicated  in  Figure  1,   the  rows  and  columns  are  connected  to  the  scanning  clock 
drivers  and  preamplifiers  respectively.     The  device  operation  can  be  divided  into  three 
steps.     During  the  first  step,   the  photon-generated  charges  are  integrated  and  stored  in 
the  column  site  of  the  pixel.     The  integration  time  is  much  longer  than  the  RC  time  con- 
stant of  the  row  or  column.     Hence,   there  is  no  significant  nonuniformity  enhanced  in 
this  step.     During  the  second  step,   all  the  columns  are  floated  simultaneously  and  the 
photo-charges  are  transferred  into  the  row  sites  by  applying  a  voltage  swing  at  the 
appropriate  row.     During  the  third  step,   the  induced  charges  due  to  this  transfer  start  to 
redistribute  over  the  entire  column,   and  the  preamp  is  readout  at  the  sampling  time  (de- 
fined by  the  duration  from  the  onset  of  the  row  voltage  swing  to  the  readout) .     The  column 
biases  are  then  reset  for  the  charge  integration  of  the  next  cycle.     If  the  array  is  oper- 
ated at  sufficiently  high  frame  rates  such  that  the  row  drive  and  induced  column  voltages 
do  not  reach  steady-state  values,   then  a  spatial  pattern  nonuniformity  can  result  from  two 
DRC  effects,   as  depicted  in  Figure  2.     The  first  DRC  effect  causes  varying  transferring  volt- 
ages along  the  row,   as  presented  in  Appendix  A.    The  second  DRC  effect  results  in  varying  out- 
put level  with  pixel  position  along  a  column  because  of  the  different  distances  for  the  induced 
charge  to  travel  to  reach  the  preamplifier.     The  second  effect  is  analyzed  in  Appendix  B. 

For  the  simulation,   it  is  assumed  that  the  transferred  charge  is  linearly  proportional 
to  the  transferring  voltage^  at  the  time  t=  (h)    RC   (row)    after  applying  the  voltage  swing. 
The  photo-charges  are  100%  transferred  when  the  transferring  voltage  is  egual  to  the  voltage 
drive.     The  charge  transfer  efficiency  as  a  function  of  the  pixel  position  along  the  row  is 
derived  in  Appendix  A.     The  results   for  time,   t,  equal  to  k  the  RC  product  for  the  row  are 
shown  in  Figure  3. 

For  the  simulation,   it  is  also  assumed  that  the  charges  at  the  pixel  redistribute  at 
t=  h  RC    (row)  ,  and  then  readout  at  the  column  preamp  occurs  at  t  =  h  RC    (row)    +    (Jj)  RC 
(column).     RC   (row)   and  RC   (column)    are  the  RC  product  of  the  row  and  column, respectively . 
The  output,   normalized  by  the  steady-state  readout,   is  obtained  as  a  function  of  pixel  po- 
sition along  the  column  in  Appendix  B.     The  specific  data  for  t = h  RC    (column)    is  replotted 
in  Figure  3  for  the  column  effect. 

In  order  to  simplify  the  analysis,   the  nonuniformity  patterns  were  simulated  for  a  12x12 
array.     The  normalized  output  of  each  pixel  is  obtained  as  the  product  of  both  ordinates  of 
Figure  3  by  selecting  the  proper  pixel  position.     With  the  column. sites  sequenced  in  each 
row  as    the    abscissa,     such    an    output  nonuniformity  is  presented  in  Figure  4.     Because  of 
symmetry  about  the  center  of  the  array,  only  the  data  for  rows  1  through  6  are  plotted  in 
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PIXEL  POSITION  FROM  COLUMN  PRE  AMP 
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Figure  3.      Signal  transfer  functions 
(defined  by  the  ratio  of 
the  quantity  with  DRC  to 
that  of  the  ideal  case  with- 
out DRC)  . 
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Figure  4.     Simulated  output  as  a  result  of 
the  combined  row  and  column  DRC 
effects  for  a  12  x  12  array.  Mirror- 
image  results  are  obtained  for  rows 
7  through  12. 


this  figure.     Figure  4  is  complicated  by  the  odd-even  connections  to  the  driver/preamp  at 
opposite  ends  of  the  row/column  electrodes.     Since  one  row  is  addressed  at  a  time,   the  un- 
equal distances  of  the  odd  and  even  column  sites  to  the  preamp  results  in  the  two  curves  in 
Figure  4.     The  upper  curve  corresponds  to  the  columns  with  the  shorter  distance  from  the 
preamp  to  the  driven  row.     Because  the  odd  and  even  rows  are  connected  to  the  driver  at 
opposite  ends  of  the  electrodes    (see  Figure  1) ,   a  type  of  reflection  symmetry  with  respect 
to  the  neighboring  row  is  obtained  as  shown    in    Figure  4 . 


IV.     Nonunif ormity  patterns 


For  simplicity  in  the  simulation,   only  two  levels  of  gray  have  been  considered,  black 
and  white.     Thus,   the  simulated  nonunif ormity  pattern  is  composed  of  white  and  black  pixels 
depending  on  whether  the  pixel  output  is  greater  or  less  than  a  selected  display  threshold 
level.     By  varying  this  level  from  1.0  to  0.6   in  Figure  4,   five  simulated  patterns  were 
obtained,   as  shown  in  Figure  5.     The  DRC  effect  of  the  electrode  is  shown  by  the  fact  that 
the  white  pixels  always  show  up  near  the  connected  ends  of  the  row  or  column.     The  odd-even 
effect  due  to  the  driver/preamp  odd-even  connection  forms  a  pattern  which  is  symmetrical 
about  the  center.     Three  types  of  sub-patterns  are  also  evident:  namely,   the  "dotted"  sub- 
patterns,   the  "barred"  sub-patterns,   and  the  "plain"  sub-pattern. 

The   "plain"   sub-pattern  is  located  at  the  center  area.     As  the  threshold  level  decreases, 
this  area  changes  from  a  full  dark  area  to  a  full  bright  area.     The   "barred"  sub-patterns, 
with  characteristic  stripes  of  continuous  white  or  black  pixels,    are  located  at  the  side 
areas  between  corners.     The   "dotted"  sub-patterns  are  located  at  the  four  corners.  These 
sub-patterns  are  characteristic  of  isolated  white  pixels  at  high  threshold,   and  isolated 
black  pixels  at  low  threshold.     This  shows  that  the  brightest  and  darkest  pixels  are  both 
occuring  at  the  corners  of  the  array. 

From  a  previously  recorded  video  tape,   it  was  possible  to  obtain  three  experimental  non- 
uniformity  patterns  on  an  InSb  32x32     CID  array.     These  patterns,   shown  in  Figure  8,  were 
obtained  by  exposing  the  32  x  32  imaging  array  to  a  uniform  illumination  and  subsequently 
lowering  the  display  threshold  level.     As  can  be  seen,  many  of  the  nonunif ormity  patterns 
discussed  in  the  simulation  results  are  obvious  in  the  experimental  results.     Bright  edges, 
odd-even  patterns,  and  "plain",    "dotted",  and  "barred"  sub-patterns  are  evident  in  the 
experimental  displays.     The  symmetery  to  the  center  is  also  observable  in  these  figures. 
It  is  thus  concluded  that  the  experimental  pattern  nonunif ormities  of  this  array  were  signi- 
ficantly influenced  by  the  DRC  effect. 
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Figure  5.  Simulated  nonuniformity  patterns  of  a  12x12  array 
by  selecting  various  display  threshold  levels  from 
Figure  6. 


Figure  6.     Experimental  nonuniformity  patterns  obtained  from  a 
32  x  32  CID  imaging  array  by  decreasing  the  display 
threshold  from   (a)    to   (c) . 


V.  Conclusions 


The  pattern  nonuniformities  due  to  the  DRC  effects  of  CID  arrays  have  been  analyzed  and 
are  found  to  be  significant  from  the  experimental  data  on  early  array  designs.     Since  the 
maximum  nonuniformity  and  average  output  degradation  due  to  the  DRC  effect  can  be  obtained 
from  data  such  as  Figure  4,   the  limitations  of  DRC  effects  on  sampling  readout  speed  and 
pixel  size  can  be  easily  derived.     Based  on  the  understanding  of  the  physical  origin,  the 
DRC  nonuniformities  have  been  successfully  eliminated  in  the  recent  revised  designs  of 
large  CID  arrays. 
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Appendix  A 

Transient  analysis  of  a  voltage  step 
applied  at  one  end  of  the  row  DRC  network 

By  applying  current  continuity  and  ohm's  law  to  a  DRC  network8,  the  differential  equation 

a  2 v  av 

 r-    =  r  c  (Al) 

ax2         3  st 

is  obtained.  rs  is  the  sheet  resistance  in  ohms  per  square,  and  c  is  the  distributed  capa- 
citance per  unit  area.     With  the  boundary  conditions 

V   (X,   t  =  0")      =  0  (A2) 


and 


V   (X  =  0,   t  =  0  +  )     =  VDD  (A3! 


(X   =  L,   t)      =0  (A4) 


a  x 


the  mathematical  equations  are  analogous  to  the  solid  diffusion  problem  in  predeposition 
process  with  constant  source  at  one  side  of   a  semiconductor  slab.     The  solution  thus  can 
be  expressed  in  terms  of  complementary  error  function.     In  order  to  satisfy    a  V/ a  X = 0  at 
x  =  L,   reflection  components  at  x  =  L  are  added.    For  satisfying  V   (0,  t)  =  VDD ,   the  reflected 
component    at    X = 0  is  substracted.     The  resultant  solution  with  convergent  components 


V   (X,  t) 
VDD 


-   v       /       n    L~.f     i2n+  X/L,  e     ,2n  +  2  -  X/L 1 

nf0  Ierfc     2^~t7RC       +  erfc    (       2 .t/RC      1  <A5> 


is  obtained.  This  solution  is  plotted  as  a  function  of  position  X/L  at  various  times  t/RC 
in  Figure  Al  and  is  replotted  as  a  function  of  t/RC  at  various  X/L  in  Figure  A2 . 


u 

II 

SPIE  Vol.  341  Real  Time  Signal  Processing  V  (1982)  /  227 


0 


Figure  Al .  Voltage  normalized  by  the  Figure  A2 .  Voltage  normalized  by  the 
voltage  step  is  expressed  voltage  step  is  replotted 

as  a  function  of  the  posi-  as  a  function  of  time  at 
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Appendix  B 


Transient  analysis  of  the  charge 
transferred  from  a  pixel  of  the  column  DRC  network 


When  the  charge    A  Q  is  transferred  out  from  a  column  site  at  x  =  p,   that  voltage  change 
AV    (=   AQ/  A  C    where   AC    is  the  per-pixel  capacitance)    is  induced  at  that  pixel.  This 
voltage  change  will  redistribute  over  the  entire  floating  electrode.     By  applying  the 
boundary  conditions  of  the  column  voltage 

Vc    (x,   t  =  o~)    =   AV    Ap  6  (x  -  p)  (Bl) 


and 


V     (x,   t  =   co)    =    ..  u  v  (B2) 
c  p 


(x  =  o,    t)    =  0  (B3) 


dx 


6Vc 

— (x  =  L,   t)    =0  (B4) 

o  x 

to  equation  Al  with  the  pixel  size   Ap    assumed  much  smaller  than  the  electrode  length  L 
and  P  defined  as  the  number  of  pixels  per  volume;   the  problem  is  analogous  to  the  diffusion 
of  a  localized  impurity  delta-function.     In  order  to  satisfy   (B3)    and   (B4) ,   reflection  com- 
ponents are  added  to  end  up  with 

Vc    (x,t)  1  »  -^(^P  +  2n)     -JC(x+J>+2n)|  (B5) 


AV/P 


2  J  n  t/RC 


(e  +e  ) 


This  equation  is  plotted  in  Figure  Bl  for  the  output  reading  at  x =  0  versus  normalized 
sampling  time,  with  charge  transferred  from  five  different  positions  of  the  column  elec- 
trode . 
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Figure  Bl.     Column  voltage  signal  read  at  x  =  0, 
is  normalized  by  the  steady-state 
reading  and  is  expressed  as  a  func- 
tion of  time  with  the  charge  trans- 
ferred from  five  different  positions 
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Abstract 

This  paper  describes  a  new  method  of  surface  potential  measurement  for  MIS  infrared  focal 
plane  arrays.     The  key  feature  of  this  method  is  a  charge  sensitive  amplifier  which  detects 
the  surface  potential  directly.     The  surface  potential  is  subject  to  photo-generated  charge 
carriers  stored  in  a  potential  well  as  well  as  the  gate  voltage.     Therefore,   this  measure- 
ment can  be  used  for  both  electronic  and  optical  characterization  of  an  MIS  infrared  imager 
such  as  an  infrared  charge  coupled  device   (IRCCD)    or  an  infrared  charge  injection  device 
(IRCCD).    Mercury  cadmium  telluride    (HgCdTe)  IRCIDs  with  3x5  pixels  were  evaluated  using 
this  technique.     The  measurement  was  controlled  by  HP  System  35  and  proved  more  accurate, 
informative,   and  speedy  than  the  conventional  capacitance-voltage    (C-V)  measurement. 

Introduction 

Using  charge  transfer  devices    (CTDs)    is  promising  as  an  approach  to  staring  infrared  fo- 
cal plane  arrays.     IRCTDs  feature  a  self-scanning  function  as  well  as  infrared  detecting 
function.     These  functions  are  based  on  simple  MIS  structure  formed  on  highly  sensitive  in- 
frared material,  where  photo-generated  charge  carriers  are  stored  in  a  potential  well  at  the 
insulator-semiconductor  interface  under  the  metal  electrode.     The  charge  carriers  can  be 
transferred  between  two  adjacent  potential  wells  by  applying  voltage  to  each  electrode.  The 
charge  handling  capability  of  the  CTDs  is  limited  by  the  size  of  the  potential  well  and  the 
interrelationship  between  adjacent  potential  wells.     The  potential  well  can  be  measured  by 
the  surface  potential,   Os ,   of  the  empty  well.     Therefore,   the  relationship  between  applied 
gate  voltage,  Vg,   and  the  surface  potential  is  the  most  important  factor  for  device  opera- 
tion and  performance. 

Despite  the  great  importance  of  the  Os-Vg  characteristic,   the  surface  potential  of  the 
IRCTD  has  never  been  measured.     This  is  because  it  is  difficult  to  introduce  the  convention- 
al Os-Vg  measurement  using  field  effect  transistors    (FETs)   on  a  chip  which  is  commonly  used 
for  a  silicon  MIS  device. *     At  present,  FETs  can  not  be  easily  fabricated  on  the  infrared 
material.     The  IRCTDs  reported  so  far  have  been  evaluated  using  the  C-V  measurement. The 
C-V  measurement  usually  requires  extra  devices    (i.e.   large  MIS  diodes)   on  a  chip  to  achieve 
accurate  evaluation.     Otherwise,   this  measurement  yields  rather  inaccurate  results  because 
of  a  relatively  large  parasitic  capacitance  caused  by  an  overlapping  electrode  structure  and 
the  need  for  a  cryogenic  assembly. 

This  paper  proposes  a  new  method  of  0s-Vg  measurement.     This  method  features  a  charge 
sensitive  amplifier  and  requires  no  extra  device  on  a  chip.     It  is  also  free  from  a  para- 
sitic capacitance.     The  method  was  tested  on  HgCdTe  IRCIDs  with  3x5  pixels.     It  was  also 
adapted  for  semi-automatic  evaluation  of  the  device  using  HP  System  35.     The  measurement  can 
be  carried  out  on  IRCIDs  in  a  normal  operating  environment.     Electronic  and  optical  charac- 
teristics of  the  device  can  be  derived  from  the  measured  0s-Vg  and  0s-time  characteristics. 
Operating  conditions  like  applied  voltage  levels  and  integration  time  can  be  determined  from 
these  characteristics. 

Basic  theory 

Using  a  charge  sensitive  amplifier  is  an  old  concept.     It  was  successfully  applied  in  a 
split-electrode  CCD  filter.5     It  can  also  be  used  as  a  charge  readout  circuit  for  a  CID.6 
The  proposed  0s-Vg  measurement  is  based  on  this  concept. 

Charge  sensitive  amplifier 

A  charge  sensitive  amplifier  consists  of  an  operational  amplifier  and  a  resettable  feed- 
back capacitor,   Cf,   as  shown  in  Fig.l.     A  sensing  capacitor,   Cs ,   is  connected  to  a  negative 
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input  terminal.     A  positive  input  terminal  is 
kept  at  the  ground  potential.     When  a  reset 
pulse,   0R,   is  applied  to  a  reset  FET,  the 
feedback  capacitor  is  discharged  and  output 
voltage,     Vout,   is  reset  to  0  V.     After  the 
reset  FET  is  cut  off,   a  sensing  line  is  kept 
floating,   so  that  a  voltage  change  applied  to 
Cs,  E,  causes  a  potential  change  on  the 
sensing  line.     The  operational  amplifier  keeps 
both  input  terminals  at  the  same  potential  by 
driving  a  charge  through  the  feedback  capaci- 
tor.    The  output  voltage  developed  across  the 
feedback  capacitor  is  derived  from  a  charge 
equation  which  assumes  the  total  charge  on  the 
sensing  line  to  be  constant. 


Figure  1.     Charge  sensitive  amplifier. 


Vout  = 


CS  E 


CF(1  -  1/A) 


(1) 


where  A  is  open-loop  gain  of  the  operational  amplifier.     The  open-loop  gain  is  usually  so 
large  that  the  1/A  term  can  be  ignored.     A  parasitic  capacitance,     Cp,   shown  in  the  figure 
has  no  influence  on  the  output  voltage  unless  voltage  applied  to  Cp  changes. 


Vsub 


Figure  2.     Setup  for  0s-Vg  measurement. 


H   :   0.5  ms/div.       V   :   0.2  V/div. 
Vsub  =  12  -  18  V 
Figure  3.     Output  waveform  from  a 
measuring  circuit. 


Surface  potential  measurement 


Figure  2  shows  the  circuit  for  the  surface  potential  measurement  with  a  schematic  cross- 
section  of  an  IRCID  cell.     The  cell  has  an  overlapping  multi-level-electrode  structure.  A 
field  plate    (FP) electrode  defines  an  active  area,   and  column/opaque  and  row/semitrans- 
parent   (EC  and  ER,   respectively)   electrodes  cover  this  area.     The  electrode  being  measured 
(EC  electrode  in  this  case)    is  connected  to  the  charge  sensitive  amplifier  through  a  swit- 
ching transistor,   FET1 ,   for  "charge  integration",   or  is  held  at  the  injection  voltage,  Vin j , 
through  a  switching  transistor,   FET2 ,   for  "charge  injection".     The  IRCID  is  biased  from  its 
substrate  in  place  of  gate  voltage,  Vg,  where  substrate  voltage,  Vsub,   is  equivalent  to  -Vg. 


At  first,   the  electrode  is  held  at  Vinj   through  FET2  and  separated  from  the  charge  sensi- 
tive amplifier  by  FET1.     The  injection  voltage  is  chosen  to  cancel    (Vsub  +  Vfb)    so  that  the 
semiconductor  surface  is  accumulated,  where  VfB  represents  flatband  voltage  of  the  device. 
At  the  same  time,   the  charge  sensitive  amplifier  is  reset.     Then,   the  FET2  is  cut  off  and 
the  electrode  is  separated  from  Vinj .     The  FET1  turns  on  so  that  the  electrode  connects  to 
the  charge  sensitive  amplifier  and  is  discharged  through  the  reset  FET.     As  a  result,  an 
empty  potential  well  corresponding  to  the  substrate  voltage  is  formed  under  the  electrode. 
A  little  later,   the  reset  FET  turns  off  and  the  charge  sensitive  amplifier  begins  to  inte- 
grate a  charge. 

A  room  temperature  environment  generates  the  proper  charge  for  the  measurement.  The 
photo-generated  charge  carriers  are  stored  in  the  potential  well,   lowering  its  surface  po- 
tential.    This  displacement  of  the  surface  potential  is  detected  by  the  charge  sensitive 
amplifier  through  the  gate  insulator  capacitance,   Cins,   of  the  electrode.     The  output  wave- 
form shown  in  Fig. 3  gives  a  surface  potential  versus  time    (0s-t)    characteristic  of  the  ER 
electrode  under  300K  background  irradiance.     A  potential  well  under  the  EC  electrode  is  also 
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filled  with  charge  carriers  in  a  few  milli-seconds  because  of  infrared  irradiance  through 
the  adjacent  ER  electrode.     Since  the  surface  potential  of  the  filled  well  can  be  regarded 
as  zero,   the  total  displacement  from  the  initial  potential  gives  the  surface  potential  of 
the  original  empty  well.     Thus,   the  output  voltage  can  be  written  as  follows. 


Vout  = 


Cins 


Cf 


(2) 


At  the  end  of  the  integration  period,   the  electrode  is  connected  to  the  injection  voltage 
again  so  that  stored  charge  carriers  are  injected  into  the  substrate  to  recombine  with  the 
majority  carriers.     In  this  measurement,   a  parasitic  capacitance  has  no  influence  on  the 
output  voltage  as  long  as  the  other  electrodes  are  held  at  constant  voltage  as  described  in 
the  previous  section. 

Insulator  capacitance  measurement 

The  insulator  capacitance  of  the  electrode  is  needed  to  evaluate  surface  potential  using 
Eq.2.     The  insulator  capacitance  can  easily  be  measured  using  the  same  circuit  by  superim- 
posing the  reference  pulse,   0ref,   on  the  substrate  voltage  as  shown  in  Fig. 2.     The  semicon- 
ductor surface  is  kept  at  the  substrate  potential  as  long  as  the  surface  is  accumulated.  In 
this  case,   the  MIS  structure  can  be  regarded  as  a  simple  capacitor  of  Cins.     The  voltage 
swing  of  0ref  is  also  detected  by  the  charge  sensitive  amplifier  to  give  reference  output 
voltage,   Vref.     The  reference  output  is  also  related  to  0ref  by  Eq.2,   so  that  Cins  can  be 
written  as  follows. 


Cins 


Vref 


Jref 


(3) 


where  0ref  and  Cp  are  given. 


Measuring  system  description 


The  surface  potential  should  be  measured  repeatedly,   changing  the  substrate  voltage 
little  by  little  to  obtain  a  complete  0s-Vg  characteristic.     This  measurement  should  be 
carried  out  for  each  electrode  to  evaluate  an  IRCID.     For  this  reason,   a  measuring  system 
was  built  to  swiftly  evaluate  IRCIDs  using  HP  System  35. 


Measuring  system 

Figure  4  shows  a  block  diagram  of  the  measuring  setup  and  its  interconn 
system  is  essentially  semi-automatic  and  requires  operator  interactions, 
programs  the  pulse  generator  via  address  buffer  and  controls  the  DC  power 
connected  to  the  IRCID 's  substrate.  The  output  from  the  measuring  circuit 
and  read  by  the  HP  System  35  via  12  bit  analog  to  digital  (A/D)  converter, 
processed  and  written  to  a  floppy  disc.  Processed  data  is  displayed  and  p 
in  an  appropriate  format.  The  IRCID  is  tested  at  cryogenic  temperatures  i 
with  a  germanium  window.  The  room  temperature  environment  takes  the  place 
source.     A  controlled  infrared  source  should  be  used  for  accurate  optical 
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Figure  4.     Block  diagram  of  a  measuring  system. 
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Each  electrode  of  the  IRCID  is  connected  to  the  measuring  circuit  via  1/64  terminal  selector 
which  is  adaptable  to  IRCIDs  with  up  to  32  x  32  pixels.     The  output  waveform  is  always 
monitored  on  a  scope. 

Measuring  procedure 

Before  measuring,  the  first  electrode  is  tested  step  by  step  from  the  keyboard  according 
to  the  programmed  "initial  measuring  mode".     The  operator  has  to  determine  the  desired 
integration  period  and  required  substrate  voltage  range  from  the  displayed  data  and  output 
waveform.     The  integration  period  should  exceed  maximum  storage  time  of  the  electrode,  but 
not  by  much.     An  insufficient  integration  period  causes  an  error  in  measured  surface 
potential.     Too  long  an  integration  period  picks  up  hum  noise.     The  substrate  voltage  range 
should  correspond  to  the  surface  potential  variation.     Excess  substrate  voltage  might  cause 
the  device  to  break  down.     Since  EC  and  ER  electrodes  of  an  HgCdTe  IRCID  have  a  different 
storage  time  and  a  different  substrate  voltage  range   (i.e.   flatband  voltage),   the  operator 
has  to  set  the  mode  twice  in  the  measurement.     Some  mode  parameters  with  their  typical 
values,  which  can  be  set  from  the  keyboard,   are  listed  below. 


1.  Integration  period,  2  -  5  ms 

2.  Substrate  voltage  range,  5  -  20  V 

3.  Voltage  increment,  0.5  V 

4.  Step  interval,  0.5  s 


Once  the  measuring  mode  is  set,   the  insulator  capacitance  is  measured  at  a  substrate 
voltage  of  O  V  which  usually  satisfies  the  accumulation  condition  for  HgCdTe  MIS  devices. 
Then,   the  supplied  voltage  moves  to  its  minimum,   gradually  steps  up  to  its  maximum,  and 
returns  to  0  V.     At  each  step  of  the  substrate  voltage,   the  output  from  the  measuring 
circuit  is  read  from  the  A/D  converter.     A  step  interval  of  0.5  seconds  is  sufficient  for 
the  operator  to  deal  with  an  unexpected  situation  such  as  a  device  breakdown.     After  the 
0s-Vg  measurement,   some  initial  data  below  flatband  voltage  is  averaged  to  give  the  offset 
voltage  from  the  measuring  circuit.     All  data  are  corrected  using  the  offset  voltage  and 
insulator  capacitance,   and  are  written  to  the  floppy  disc.     The  processed  data  is  displayed 
and  may  be  printed.     Then,   the  next  electrode  is  selected  and  measured. 

In  the  0s-Vg  measurement,   a  sample  and  hold   (S/H)    timing  is  fixed  near  the  end  of  the 
integration  period.     If  needed,   the  output  waveform  can  be  read  by  shifting  an  S/H  timing. 
This  measurement  is  usually  carried  out  at  the  maximum  usable  voltage  to  evaluate  the 
maximum  storage  time  for  the  ER  electrode. 


Results  and  discussion 


The  0s-Vg  measurement  was  carried  out  on  HgCdTe  IRCIDs.     The  C-V  characteristics  were 
also  measured  using  MIS  diodes  formed  on  the  same  chip  for  comparison.     The  measured  0s-Vg 
characteristics  were  examined  and  confirmed  to  be  useful  for  device  characterization  and 
optimization  of  operating  conditions.     The  0s-t  characteristic  was  also  confirmed  to  be 
useful  for  optical  characterization  of  the  device. 

Device  description 


The  IRCIDs  were  fabricated  on  n-type  HgCdTe  substrate  using  multi-level-electrode  MIS 


1 


1 


TEST  EC  20PF 


TEST  ER 13  PF 


EC  1  7  6  PF 


ER  1  6.4PF 


-18     -16     -U      -12     -10      -8       -6  -U 
GATE  VOLTAGE    Vg  (  V  ) 


> 

LL 
Q. 


CT) 

o 


LU 

o 


t 

< 

CJ 
LU 


0 


Figure  5.     HgCdTe  CID  chip  with  a 
3x5  CID,   unit  cell, 
and  test  MIS  diodes.  Figure  6.     C-V  characteristics  measured  on  HgCdTe  chip. 
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technology.     The  device  has  an  overlapping  electrode  structure  with  three  level  electrodes 
and  four  level  zinc  sulphide    (ZnS)    layers  formed  on  the  native  oxide.     A  photomicrograph  of 
a  chip  is  shown  in  Fig. 5.     The  chip  contains  an  IRCID  with  3x5  pixels,   a  unit  IRCID  cell, 
and  four  test  MIS  diodes.     The  IRCID' s  cells  are  located  in  a  100  urn  space.     The  unit  IRCID 
cell  is  about  ten  times  larger  than  the  actual  IRCID' s  cell  and  each  test  MIS  diode  has  an 
area  of  200  x  500  urn2 . 

C-V  measurement  and  its  limitation 

The  C-V  characteristics  of  EC  and  ER  level  electrodes  are  shown  in  Fig. 6.     These  were 
measured  using  test  MIS  diodes  in  a  cold  background  condition.     Although  these  MIS  diodes 
were  carefully  designed  to  have  a  minimum  of  overlapping  area,   a  parasitic  capacitance  can 
not  be  negligible.     The  C-V  characteristics  of  EC  and  ER  electrodes  measured  on  the  IRCID 
are  also  shown  in  the  figure.     In  the  IRCID,  each  electrode  has  a  much  smaller  capacitance 
than  the  test  MIS  diodes  and  shows  a  slight  change  in  the  C-V  curve  because  of  a  relatively 
large  parasitic  capacitance.     A  typical  electrode  capacitance  is  6  pF,  while  the  effective 
MIS  capacitance  of  the  electrode  is  less  than  1  pF. 

Theoretically,   it  is  possible  to  derive  device  parameters  such  as  flatband  voltage, 
insulator  capacitance,   and  bulk  concentration  from  the  C-V  measurement.     Because  of  the 
relatively  large  parasitic  capacitance,   such  an  evaluation  dose  not  give  practical  results 
for  the  actual  IRCID' s  electrodes.     The  C-V  measurement  gives  a  little  information  about 
device  operation  and  performance. 

0s-Vg  characteristic  and  optimization  of  device  operation 

Figure  7  shows  the  output  format  of  0s-Vsub  characteristic.     This  is  equivalent  to  the 
usual  0s-Vg  characteristic.     It  should  be  noted  that  this  characteristic  was  measured  on  the 
actual  IRCID' s  EC  electrode.     Flatband  voltage  of  the  electrode  can  easily  be  determined 
from  the  point  where  the  0s-Vg  characteristic  starts  to  rise.   The  flatband  voltage  coincides 
with  the  value  evaluated  from  the  C-V  characteristic  shown  in  Fig. 6.     The  0s-Vg  characteris- 
tic has  a  maximum  surface  potential,   0smax,   and  falls  at  higher  voltage.     This  shows  the 
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Figure  8 . 

tunneling  limit  where  the  formation  of  an  empty  potential  well  is  limited  by  tunnel  current, 


(b)    Surface  potential  diagrams  for  a 
parallel  injection  readout  with 
a  signal  charge. 


The  0s-Vg  characteristics  can  be  used  to  determine  optimum  operating  voltage  levels  for 
an  IRCID.  Figure  8  shows  schematic  0s-Vg  curves.  Surface  potential  diagrams  for  parallel 
injection  readout  are  also  shown  in  the  figure.  In  the  parallel  injection  readout,  photo- 
generated  charge  carriers  are  stored  under  an  ER  electrode  and  then  transferred  to  under  an 
EC  electrode  for  signal  readout  as  shown  in  Fig. 8(b).  From  the  0s-Vg  characteristic  and  the 
insulator  capacitance,  the  maximum  storage  charge,  Qmax,  stored  under  the  ER  electrode  can 
be  evaluated  as  follows. 


Qmax  =  CER(Vmax  -  VFB) 


(4) 


where  CeR  and  VpE  represent  insulator  capacitance  and  flatband  voltage  of  the  ER  electrode, 
respectively.     The  available  maximum  gate-voltage  is  Vmax  which  corresponds  to  the  maximum 
surface  potential,  0smax.     The  actual  charge  handling  capability  of  the  cell  is  limited  by 
the  interrelationship  with  the  adjacent  potential  well  of  0eC  as  shown  in  Fig. 8(b),  where 
0EC  represents  the  surface  potential  under  the  EC  electrode.     The  signal  charge,   Qs ,  is 
given  by, 


Qs  =  CER(Vmax  -  VER) 


(5) 


where  Ver  is  the  gate-voltage  which  forms  a  potential  well  of  0eC  under  the  ER  electrode. 
The  signal  charge  is  transferred  to  under  the  EC  electrode  and  stored  there.  This  lowers 
its  surface  potential  to  0qp-     The  Qs  can  also  be  written  as  follows. 


Qs  =  CEC(VEC  -  V0P) 


;6) 


where  CEc  is  insulator  capacitance  and  Vec  is  the  biasing  level  of  the  EC  electorde.  The 
Vop  is  the  operating  point  voltage  which  corresponds  to  0OP- 

In  infrared  imaging,   the  signal  charge  consists  of  300K  background  charge  and  real  signal 
charge,  qs,   from  the  object.     Since  the  video  output  is  proportional  to  the  surface  poten- 
tial change  caused  by  qs,   the  video  output  depends  on  the  slope  in  0s-Vg  curve  as  well  as 
the  signal  charge.     When  the  signal  charge   (i.e.  qs)   increases,   the  operating  point, Vqp,  is 
lowered,   and  the  slope  at  V0p  decreases  because  of  an  increased  depletion  capacitance. 
Therefore,   there  is  an  optimum  operating  point  which  gives  the  maximum  video  output.  For 
given  VEC,  Qs  can  be  calculated  using  Eq.5,  where  VER  is  determined  from  0s-Vg  characteris- 
tics.    Then,  Vqp  can  be  evaluated  using  Eq.6  for  calculated  Qs .     A  slope  at  the  V0p  can  be 
read  from  the  0s-Vg  characteristic.     The  video  output  can  be  estimated  by  multiplying  the 
slope  with  qs  which  is  proportional  to  Qs.     This  procedure  makes  it  possible  to  determine 
optimum  operating  voltage  levels  and  charge  handling  capability  of  the  device. 

0s-t  characteristic 

As  shown  in  Fig. 3,  the  output  waveform  gives  0s-t  characteristic  of  the  electrode. 
Maximum  storage  time  under  300K  background  irradiance,   tmax»   can  be  evaluated  from  the  0s-t 
characteristic  corresponding  to  the  maximum  gate-voltage.     The  300K  background  photon 
irradiance,  0b,  on  the  focal  plane  can  be  calculated  using  the  cutoff  wavelength  and  field 
of  view   (FOV)   of  the  device.     External  quantum  efficiency,^  ,   can  be  estimated  using  the 
following  equation. 


Qmax  =  qr\0B  tmax  S 


(7) 
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where  S  represents  the  photoactive  area  of  a  cell. 


0.5 


\*i  i  im 

- 

\§  jjm\  \ 

i        I        i  \ 

'0.2       0.5      1       2  5x1015 
BULK  CONCENTRATION  (cm"3) 

Figure  10.  Maximum  surface  poten- 
tial v.s.  bulk  concent- 
ration . 


Responsivity ,   R,   of  the  device  can  also  be  estimated  using  this  external  quantum  effi- 
ciency equation  and  the  slope  in  the  0s-Vg  curve  of  the  EC  electrode.     Video  output  for  one 
photon  irradiance,   that  is  responsivity,   can  be  written  as  follows. 


R  =  K  ( 


d0s 


) 


(8) 


dvOP  CEC 


where  a  differential  term  represents  a  slope  at  the  operating  point,   and  K  stands  for 
readout  gain.     For  a  charge  sensitive  amplifier,   this  readout  gain  is  given  by  Cp^/Cp.  Tne 
responsivity  of  the  1.8  x  10~°  V/photon  estimated  using  this  method  showed  reasonable 
agreement  with  the  responsivity  measured  on  the  actual  IRCID. 

Comparison  with  theoretical  curves 

The  reliability  of  the  0s-Vg  measurement  was  confirmed  by  comparing  it  with  the  well 
established  theoretical  equation  given  by 7 

0s  =  Vo  +   (Vg  -  VFB)    -    (Vo2  +  2Vo(Vg  -  VFB))1//2  (9) 

q  Z  s  Nd  Xins 

where  Vo  =  —  — ~  

<^Znsz  L  o 


Solid  curves  in  Fig. 9  show  such  calculations,  where  device  parameters  were  determined  from 
Hall  measurement,  fabrication  conditions,  and  measured  electronic  constants.  Native  oxide 
thickness  was  calculated  in  terms  of  ZnS  thickness.  The  measured  surface  potential  agrees 
well  with  these  curves  at  lower  voltage. 

The  tunneling  limit  was  simulated  using  W.W.  Anderson's  tunnel  current. ^  Simulated 
results  are  shown  by  broken  lines  in  the  figure.     In  this  simulation,   an  energy  gap  is 
determined  from  a  measured  cutoff  wavelength  of  the  device.     Simulated  results  are  summa- 
rized in  Fig. 10  where  the  0smax  variance  with  a  bulk  concentration  is  shown.     The  measured 
0smax  of  various  IRCIDs  correspond  well  to  these  curves  unless  there  is  a  bulk  defect 
which  accelerates  tunneling. 

These  analyses  show  that  the  0s-Vg  characteristic  can  be  used  to  evaluate  deveice  para- 
meters such  as  flatband  voltage,   insulator  capacitance    (thickness) ,   bulk  concentration,  and 
cutoff  wavelength.     The  drawback  of  this  measurement  is  that  the  measured  surface  potential 
is  already  averaged  over  IRCID  cells.     Therefore,   nonunif ormity  along  the  electrode  and 
point  defects  can  not  be  found  in  this  measurement.     The  conventional  C-V  measurement  has 
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the  same  limitation. 


Measuring  system  performance 

At  present,   the  speed  of  the  measurement  is  limited  by  the  measuring  interval  of  0.5 
second.     This  is  inevitable  as  long  as  operator  interaction  is  assumed.     Judging  it  from  the 
hardware  angle,   the  speed  of  the  system  is  limited  by  the  A/D  converter  whose  response  time 
is  nearly  100  ms.     The  measuring  time  could  be  improved  to  5  s/electrode  from  the  present 
15  -  20  s/electrode.     For  such  an  improvement,   software  must  be  developed  to  deal  with  un- 
expected situations.     A  main  clock  frequency  of  500  KHz  makes  it  possible  to  fix  pulse 
widths  and  positions  every  2  yis  using  a  programable  read  only  memory   (PROM) .     The  integ- 
ration period  can  be  independently  controlled  to  vary  from  0.5  ms  to  7.5  ms  in  0.5  ms  units. 
Stepping  and  reading  timing  can  be  set  independently  because  an  S/H  circuit  insulates  the 
measuring  circuit  from  the  other  system.     This  insulation  eases  the  timing  requirements  of 
the  system.     The  DC  power  supply  provides  the  voltage  of  0  to  50  V  in  50  mV  units.     The  12 
bit  A/D  converter  assures  voltage  measurement  in  1.5  mV  units. 

Since  the  number  of  the  electrodes  is  small,   the  system  prints  out  all  raw  data  in  the 
form  of  a  graph  and/or  table  as  shown  in  Fig. 7.     The  measured  data  can  be  analyzed  to 
disclose  more  useful  information  for  device  improvement  and  optimization  of  the  IRCID 
operation.     These  analyzing  programs  are  still  under  development.     Because  of  the  simple 
architechture ,   the  system  can  easily  be  adapted  to  larger  devices  by  enlarging  its  termj  r.al  ' 
selec  cor . 

Summary 

A  new  method  of  surface  potential  measurement  is  proposed.     This  method  features  a 
charge  sensitive  amplifier,   requires  no  extra  device  on  a  chip,   and  is  free  from  parasitic 
capacitance.     A  0s-Vg  measuring  system  has  been  built  to  evaluate  monolithic  HgCdTe  IRCID 
arrays  using  HP  System  35.     The  0s-Vg  and  0s-t  characteristics  of  the  device  measured  using 
this  system  agreed  well  with  theoretical  calculations.     These  characteristics  proved  useful 
for  device  characterization  and  optimization  of  operating  conditions.     Additional  measure- 
ment,  data  analysis,   and  detailed  comparison  with  actual  device  performance  are  now  under 
way.     Because  of  simple  architecture  and  accurate,   informative,   and  speedy  performance  of 
the  system,   this  can  be  an  efficient  method  to  evaluate  MIS  infrared  imaging  arrays. 
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Abstract 


The  chip  sets  and  brassboards  being  developed  in  Phase  I  of  the  VHSIC  Program  are 
described. 


Introduction 


The  Very  High  Speed  Integrated  Circuits  (VHSIC)  Program  is  developing  28  complex 
integrated  circuits  that  are  directly  responsive  to  a  wide  spectrum  of  system  require- 
ments. These  chip  designs  will  be  completed  by  the  end  of  1983  and  will  subsequently  be 
produced  in  pilot  quantities  for  application  in  functional  brassboards.  The  chips  being 
made  by  each  Phase  I  VHSIC  contractor  are  listed  in  Table  1.  While  these  chips  are  the 
most  tangible  product  of  this  phase  of  the  VHSIC  Program  it  is  important  to  note  that  the 
chips  represent  a  complex  technology  involving  not  only  the  processes  by  which  the  chips 
are  fabricated  but  also  an  array  of  functional  design,  test,  simulation,  and  software 
techniques  that  did  not  exist  heretofore.  The  establishment  of  this  new  plateau  of  VLSI 
technology  for  defense  is  most  important  but  could  not  be  accomplished  without  the  more 
tangible  goal  of  fabricating  an  array  of  immediately  useful  chips.  These  chips  would 
themselves  be  abstractions  if  they  were  not  directly  related  to  specific  needs  and  if 
their  capabilities  were  not  to  be  demonstrated  in  brassboards  that  represent  important 
system  functions.  In  the  following  paragraphs,  the  chip  sets  and  brassboard  of  each  of 
the  Phase  I  contractors  are  described. 


Table  1.     VHSIC  chip  sets 


Honeywell 


Hughes 


IBM 


Parallel  programmable 

pipeline 
Controller 


Digital  correlator 
Algebraic  encoder /decoder 
Spread  spectrum  subsystem 


Complex  multiplier 
accumulator 


TI 


TRW 


Westinghouse 


Gate  array 

Gate  array  with  memory 
Data  processor 
Array  controller  and 

sequencer 
Vector  address  generator 
Vector  arithmetic  and 

logic  unit 
Static  RAM 
Multipath  switch 


Content  addressable  memory 
Window  addressable  memory 
Register  arithmetic  logic 
unit 

Address  generator 
Matrix  switch 

16-bit  multiplier /accumulator 

Microcontroller 

Four-port  memory 


Arithmetic  unit 
Pipeline  arithmetic  unit 
Extended  arithmetic  unit 
Controller 
Gate  array 
Static  RAM 


Honeywell 

Honeywell's  brassboard  is  the  electro-optical  signal  processor  (EOSP) .  Two  semicustom 
chips  are  required:  a  parallel  programmable  pipeline  (PPP)  chip  and  a  controller  chip. 
In  the  EOSP,  two  controller  chips  control  32  PPP  chips  to  meet  the  high-throughput  re- 
quirements of  typical  scene  segmentation  and  target  recognition  algorithms  --  commonly 
several  billion  operations  per  second.  Both  chips  predominantly  utilize  Integrated 
Schottky  Logic  (ISL)  technology  for  high-density  low-power  logic  (64,000  gates/cm2,  30 
uW/gate,  2.0-nsec  delay)  and  Common  Mode  Logic  (CML)  technology  for  very  high-speed  logic 
(26  ,  000  gates/cm2,  30  uW/gate,  0.6  nsec  delay).  Both  chips  are  360  mils  on  a  side  with 
equivalent  gate  counts  in  excess  of  20,000.  The  controller  chip  contains  approximately 
13,000  ISL  logic  gates  plus  on-chip  ISL  ROM  and  on-chip  CML  RAM.  The  PPP  chip  contains  a 
number     of     parallel     processing     elements  and  associated     memory  consisting  of  17,400  ISL 
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logic  gates  and  CML  RAM.  The  PPP  and  controller  chips  utilize  a  total  of  43  different 
macrocells.  (A  barrel  shifter,  an  arithmetic  logic  unit,  and  a  sequencer  are  among  the 
most  complex.)  Random  logic,  RAMs ,  and  ROMs  are  combined  on  a  single  chip  for  maximum 
on-chip  data  flow  efficiency.  The  PPP  and  controller  chips  can  be  used  in  a  variety  of 
system  architectures  for  a  wide  range  of  signal  processing  applications  including  acous- 
tic, millimeter-wave,  microwave,  and  optical  applications.  To  achieve  the  VHSIC  relia- 
bility requirement  (0.006%  per  1000  hours  maximum  failure  rate),  the  chips  have  been 
designed  with  extensive  self-test  capabilities  and  the  EOSP  has  spare  PPP  chips  that  are 
brought  into  play  by  self-test  provisions  at  the  subsystem  level.  In  addition,  on-chip 
redundancy  is  used  to  ensure  the  attainment  of  yield  objectives,  thus  reducing  manufac- 
turing risk.  Chip  packaging,  being  developed  jointly  with  the  3M  Corporation,  will  be 
based  on  state-of-the-art  ceramic  chip  carrier  technology  with  beam  type  interconnections 
and  proven  high-pincount  compatibility. 


Hughes 

The  CMOS/SOS  chip  technology  of  Hughes  is  being  used  to  implement  three  reconf igurable 
chips  that  provide  key  signal-processing  functions  common  to  a  broad  range  of  secure, 
anti-jam  communication  systems.  The  high  performance  of  these  chips  will  enable  them  to 
provide  cost  effective  communication  systems  into  the  1990s.  Their  flexibility,  relia- 
bility, and  low  power  will  enable  them  to  be  retrofitted  into  existing  and  developing 
systems  with  significant  cost  savings  and  improvements  in  reliability.  Detailed  technol- 
ogy insertion  plans  have  been  developed  for  the  position  location  and  reporting  system 
(PLRS) ,  where  substantial  cost  and  reliability  improvements  can  be  achieved  in  near-term 
production  programs.  In  the  Phase  I  brassboard  demonstration,  the  three  chips  will 
provide  performance  enhancement  of  the  PLRS/JTIDS  (Joint  Tactical  Information  Distribution 
System)  hybrid  for  the  future  Army  battlefield  information  distribution  system.  These 
three  chips  have  been  defined  and  the  first  chip  is  being  designed. 


The  three  chips  are  a  256-stage  digital  correlator  chip,  an  algebraic  encoder /decoder 
chip,  and  a  spread  spectrum  subsystem  chip.  These  chips  will  operate  at  25-MHz  off-chip 
clock  rates  and  100-MHz  on-chip  clock  rates.  Their  complexity  level  approaches  20,000 
gates,  and  their  full  power  at  maximum  clock  rate  is  less  than  0.7  W.  The  chips  are 
designed  for  nominal  operation  at  5  V,  but  they  are  compatible  with  3  V  operation  as  well. 
Leaded  ceramic  flatpacks  with  up  to  148  beads  will  house  the  chips. 


The  PLRS  technology  insertion  plan  covers  the  digital  signal  and  message  processor 
subsystem  module  of  the  basic  user  unit,  a  man-packable  terminal.  The  primary  impact  of 
the  VHSIC  technology  insertion  will  be  to  halve  the  size,  weight,  and  cost  of  the  signal 
and  message  processing  module,  and  to  double  the  reliability  of  this  module.  The  total 
parts  count  for  the  two-card  module  will  be  reduced  from  280  to  about  100.  A  preliminary 
estimate  of  the  reliability  improvement  indicates  that  there  will  be  an  18%  increase  in 
the  predicted  mean  time  between  failures  (MTBF)  of  the  basic  user  unit  as  a  result  of 
changing  the  single  module.  The  corresponding  cost  reduction  is  12%  of  the  basic  user 
unit  cost,   resulting  in  very  substantial  production  cost  savings. 


IBM 

IBM  is  developing  a  complex  multiply  accumulate  chip  (CMAC) .  The  CMAC  is  a  parameter- 
selectable  signal  processor  which  can  perform  100  million  multiply  and  accumulate  opera- 
tions per  second  so  that  it  can  carry  out  the  high-performance  signal  processing  algo- 
rithms required  in  the  front-end  data  stream  of  many  sensor  processing  systems.  By  using 
the  NMOS  VHSIC  technology,  a  very  high  multiply  rate  is  achieved  with  significantly  fewer 
watts  and  in  a  much  smaller  volume  than  has  been  achieved  with  current  technologies.  This 
high  performance  is  accomplished  with  a  very  simple  data  flow  that  uses  a  number  of 
multipliers  connected  in  a  linear  array.  The  simple-structure  concept  extends  to  the 
control  of  the  chip  where  a  parameter  selectable  rather  than  programmable  approach  is 
used.  The  algorithms  executed  are  highly  repetitive,  simplifying  the  control  task. 
Flexibility  is  obtained  by  loading  a  parameter,  during  initialization,  from  an  external 
controller  to  configure  the  data  flow  for  the  desired  algorithm.  The  CMAC  chip  has  two 
basic  classes  of  operation:  (1)  complex  multiplication  of  a  number  of  parallel  channels 
by  a  set  of  preloaded  weights  and  summation  of  the  products,  and  (2)  any  of  a  number  of 
delay,  multiplication,  and  summation  operations  on  a  single  channel  by  using  cascaded 
sections  of  identical  hardware.  The  CMAC  operates  with  parallel  stages  of  multiplication 
and  accumulation,  each  stage  operating  with  a  basic  clock  frequency  of  25  MHz.  Current 
technology  sizings  indicate  a  chip  density  in  excess  of  30K  gates  for  the  chip.  For 
packaging,  IBM  is  developing  a  single-chip  ceramic  carrier  with  polymide/copper  film 
interconnects . 
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The  IBM  brassboard  is  a  high-performance  acoustic  preprocessor.  The  capability  being 
furnished  will  provide  for  increased  target  detection  in  anti-submarine  warfare  (ASW) 
systems.  Requirements  in  ASW  have  been  evolving  and  growing  for  many  years  due  to  more 
sophisticated  and  quieter  threats.  VHSIC  technology  is  an  ideal  match  for  addressing  the 
acoustic  problem  in  that  the  signal-processing  capabilities  needed  for  keeping  pace  with 
the  threat  can  be  realized.  The  brassboard  configuration  specifically  addresses  an 
upgrade  to  the  input  signal  conditioner  (ISC)  of  the  AN/UYS-1.  This  "front-end"  improve- 
ment will  provide  for  increased  processing  capability  and  will  facilitate  technology 
insertion  for  those  platforms  currently  employing  the  AN/UYS-1. 


Texas  Instruments 

The  Texas  Instruments  development  is  based  on  a  small  set  of  multipurpose  programmable 
system  components  implemented  in  commercially  aligned  semiconductor  technologies;  a 
multimode  fire  and  forget  (M2F 2 )  missile  subsystem  demonstration  brassboard  and  a  compre- 
hensive set  of  software/hardware  design  tools  to  support  subsystem  design  with  the  basic 
chip  set. 

Eight  chips  have  been  defined  and  are  currently  under  development.  A  high-performance 
NMOS  memory  and  seven  logic-oriented  components  that  will  be  implemented  in  Schottky 
transistor  logic    (STL) . 

Memory  needs  are  served  with  a  single  general-purpose  component: 

o     Static  Random  Access  Memory  (SRAM) 

2~5    y sec    read/write    access ,    on-chip    parity    generate/check,    and    IK    block  write 
protection . 

The  1750A  instruction  set  architecture  (ISA)  was  selected  for  data-processing  require- 
ments and  three  logic  components  defined  as: 

o     1750A  Data  Processor  Unit  (DPU) 

A    full    1750A   ISA  with    6-MIP    16-bit    fixed-point   throughput,    multiprocessor  sup- 
port,  and  memory  error  detection/correction. 

o     Device  Interface  Unit  (DIU) 

Implements   direct  memory   access    (DMA)    operations,    provides   interval   timing,  and 
performs  instruction  I/O. 

o     General  Buffer  Unit  (GBU) 

Supports   multiple    level    bus    operation,    provides    first-in/first-out    (FIFO)  buf- 
fered transfers,   and  performs  parallel  I/O. 

These  three  chips,  together  with  the  static  RAM,  can  be  configured  as  an  application 
specific  or  a  generic  data  processing  node. 

Four  logic  chips  have  been  defined  to  meet  the  array  processing  requirements.  They 
consist  of  a  highly  concurrent  16-bit  fixed-point  arithmetic  resource  and  three  support 
chips : 

o     Vector  Arithmetic /Logic  Unit  (VALU) 

75-MOP  16-bit  fixed-point  throughput  pipelined  arithmetic  capability. 

o     Vector  Address  Generator  (VAG) 

Two-dimensional   X-Y   array   addressing,    FET  bit   reverse,    cycle   steal   I/O  support, 
and  data  memory  chip  select  decode. 

o     Array  Controller /Sequencer  (ACS) 

General  purpose  microcontroller  with  nested  do-loop  control,   a  subroutine  stack, 
and  a  register  file. 

o     Multipath  Switch  (MPS) 

Connects     memories     with     ports,      supports     cycle     steal     I/O,      single  memory/ 
multiprocessor  broadcast,   and  static  and  dynamic  state  control. 

As  with  the  data  processor,  these  four  chips  and  the  SRAM  can  be  configured  to  match 
specific  or  generic  array  processing  requirements.  The  entire  chip  set  is  TTL  I/O  compat- 
ible and  will  be  packaged  in  Joint  Electronics  Device  Engineering  Council  (JEDEC)  standard 
leadless  chip  carriers. 
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The  M2F2  VHSIC  demonstration  brassboard  will  be  based  on  a  passive  rf,  anti-radiation 
homing  (ARH)  sensor,  and  an  imaging  infrared  (I2R)  sensor.  The  data  and  array  processing 
modules  defined  and  developed  for  the  brassboard  will  be  integrated  with  two  existing 
sensors  to  provide  the  signal  and  control  processing  required  for  the  operational  function 
of  an  M2F  2  missile  subsystem.  The  functions  to  be  demonstrated  include  control  of  the  ARH 
and  I2R  sensor,  multithreat  processing  of  the  ARH  sensor  data,  precision  queuing  of  the 
I2R  sensor  to  the  radiating  target  selected,  autonomous  acquisition  of  the  potential 
target  from  the  I2R  field  of  view  and  the  concurrent  track  of  the  target  by  both  sensors. 


TRW 

The  chip  set  of  the  TRW,  Motorola,  and  Sperry-Univac  team  consists  of  eight  chips  made 
in  two  technologies.  The  four-port  memory  is  scheduled  for  fabrication  in  CMOS  and  the 
other  seven  in  triple-diffused  bipolar.  The  chips  operate  synchronously  from  a  25  MHz 
clock.  All  will  utilize  a  124-pin  ceramic  hermetic  chip  carrier  package  which  is  being 
specifically  designed  for  VHSIC  chips.  These  chips  are  designed  for  use  in  special 
purpose  preprocessors  and  general  purpose  signal  processors.  An  electronic  warfare 
brassboard  incorporating  the  chips  is  being  developed  by  TRW.     The  individual  chips  are: 


Matrix  Switch 

The  matrix  switch  chip  consists  of  several  4-bit  output  ports,  each  of  which  can 
be  connected  to  any  of  several  input  ports,   selectable  by  control  lines. 


o  Microcontroller 

The  microcontroller  provides  a  microinstruction  address  generation  function; 
special  features  are  included  that  facilitate  pipeline  control  required  for  VHSIC 
signal  processing  applications. 


o     Address  Generator 

An  address  generator  chip  generates  sequences  of  addresses  under  program  control 
without  realtime  dedication  of  the  CPU.  Once  loaded  with  sets  of  origination, 
displacement,  and  iteration  data,  single  CPU  commands  will  generate  large  se- 
quences of  addresses. 


o     Register  Arithmetic  Logic  Unit 

The  RALU  implements  16-bit  arithmetic  and  Boolean  functions.  Input  register 
banks  provide  fast  on-chip  data.  The  RALU  may  be  cascaded  to  implement  a  32-bit 
RALU. 


o     Multiplier -Accumulator 

The  16-bit  MAC  performs  combinations  of  multiplication,  addition,  and  subtrac- 
tion, including  complex  multiplication  using  two  devices.  The  inputs  and  outputs 
are  equipped  with  enable  control  to  facilitate  the  use  of  multiple  devices  on  a 
bus . 


o     Content  Addressable  Memory 

The  CAM  makes  it  possible  to  scan  large  numbers  of  memory  cells  for  a  match  with 
a  specific  value.  The  CAM  chips  can  be  connected  in  cascade  for  large  compari- 
sons . 


o     Window  Addressable  Memory 

The  WAM  makes  it  possible  to  scan  many  memory  words  with  a  given  set  of  upper  and 
lower  limits  to  determine  values  within  limits. 


Four-Port  Memory 

The  FPM  has  two  read  ports  and  two  write  ports  which  are  accessible  independently 
and  simultaneously.     Two  writes  and  two  reads  are  available  every  clock  cycle. 


Westinghouse 


The  Westinghouse  team  effort  is  focused  on  the  development  of  a  multiprocessing  com- 
puter capable  of  handling  the  signal-processing  functions  of  the  advanced  tactical  fighter 
environment.  Included  in  the  team  are  National  Semiconductor,  Control  Data  Corporation, 
Harris,  and  Carnegie  Mellon  Institute.  The  team  has  adopted  a  hierarchical  design 
approach  which  includes  a  chip  set  based  on  1.25-ym  CMOS  IC  technology,  a  set  of  modular 
signal  processors,  an  architecture  capable  of  adaptation  to  a  large  variety  of  environ- 
ments,  and  a  high-order  language  programming  approach. 

A  total  of  36  minicells,  the  basis  of  the  hierarchical  design  approach,  have  been 
designed  and  simulated.     These  minicells  are  used  to  implement  a  family  of  six  chips. 
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o  Controller 

A  general-purpose  controller  for  executing  the  1750A,  AYK-14,  vector /scaler ,  and 
vector  processor  standard  instruction  set  architectures. 

o     Pipeline  Arithmetic  Unit  (AU) 

Organized  to  perform  complex  arithmetic  at  40-MHz  pipeline  rate,  the  pipeline  AU 
chip  is  fully  programmable  to  perform  signal  processing  functions. 

o     16-Bit  AU 

The  general-purpose  25-MHz  arithmetic  unit  for  computing  and  image  processing 
applications . 

o     Extended  AU 

A  32-bit  arithmetic  chip  for  floating-point  arithmetic  in  computers  and  signal 
processors . 

o     64K  Memory 

A  high-speed  memory  chip    (20  nsec)    for  signal  processing  and  computing, 
o     VLSI  Gate  Arrays 

5K  gate  arrays  to  provide  the  required  interface  and  logic  circuitry  to  complete 
the  system  design. 


The  six  chips  are  combined  in  various  ways  to  form  five  signal  processing  modules.  A 
common  arithmetic  vector  processor  is  optimized  for  FFT,  digital  filter,  and  complex 
arithmetic  operations;  a  floating  point  processor  is  optimized  to  perform  a  mix  of  vector 
and  scalar  instructions;  a  vector /scaler  processor  is  a  smaller  version  of  the  FPP  operat- 
ing in  slower  firmware;  a  hierarchical  multiprocessor  system  controller  executes  data  flow 
in  the  processors;  and  a  bulk  memory  module  provides  data  storage  for  the  processors.  The 
chip  package  development  is  addressing  leaded  chip  carriers  with  large  pin  counts  of 
120-220  pins  and  leads  on  20,    16.67,   and  12.5  mil  centers. 


Conclusions 


These  chips  and  brassboards  that  will  be  fabricated  in  the  VHSIC  Program  are  extendable 
to  a  variety  of  applications.  Table  2  lists  potential  applications  of  the  funded  brass- 
boards  in  terms  of  specific  systems.  In  the  proposals  for  Phase  I,  more  than  30  other 
brassboards  using  the  28  basic  chips  and  a  few  others  were  suggested  but  are  not  yet 
funded.  Each  of  these  would  also  have  multiple  potential  system  applications.  Efforts 
are  now  underway  to  increase  this  technology  transition/insertion  effort  so  as  to  obtain 
rapid  application  of  the  VHSIC  technology  in  a  large  number  of  defense  systems.  However, 
the  path  from  the  chip  to  the  operational  system  is  still  tortuous  and  will  require 
dedicated  management  attention  to  obtain  the  results  we  seek. 

Table  2.     Brassboard  system  impact  examples 


Contractor  Potential  System  Applications 
 Brassboard    Army    Navy   Air  Force  

Texas  Instruments  Hellfire,   Tank  Breaker                                   Locust,   I2R  Seeker 

Multimode  Fire  and  Forget  I2R  Seeker,  ARP 
Missile 


Hughes  PLRS/JTIDS  A-J  Modem,  HF  Como,  JTIDS 

Battlefield  Information  JTIDS 
Distribution  System 

TRW  EW  Targeting  System,  ESM  New  Threat 

EW  Signal  Processor  Quicklook  II,  Processor  Warning  System, 

Advanced  Guardrail  APMS 


IBM  P-3C,  SUBACS, 

Acoustic  Signal  Processor  DDG-X 

Honeywell  AAH,   RPV,  M-l,   LH-X  IRST,   EO  for  LANTIRN, 

Electro-optical  Signal  Hellfire,  AIFS  F-18  and  F-14     Pave  Tack 

Processor  I2R  Seeker  SEAFIRE  I2R  Seeker 


Westinghouse  DIVAD,  M-l,  ASPJ  Advanced  Tactical 

Advanced  Tactical  Fighter  Quiet  Radar  Fighter,  B-l, 

Radar  Processor  Improved  AWACS 
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Relative  performance  of  very  high  speed  integrated  circuit  (VHSIC) 
chip  sets  for  selected  signal  processing  functions 


James  D.  Marr 

Electrical  Engineering  Department,  University  of  Alabama  in  Huntsville,  Huntsville,  Alabama  35899 


Abstract 

The  Very  High  Speed  Integrated  Circuit   (VHSIC)  program  was  started  in  1978  to  produce 
high  performance  signal  processing  chips  for  military  systems.     Contracts  have  been  let  for 
Phase  I  for  six  chip  sets.     Modules  using  these  chips  are  also  being  developed  under  this 
phase.     Six  of  the  funded  modules  and  two  hypothetical  modules  are  examined  for  nine  signal 
processing  tasks. 

Introduct  ion 

The  Department  of  Defense  initiated  the  VHSIC  Program  in  1978  to  stimulate  private  in- 
dustry to  produce  the  technology  needed  for  military  applications  in  the  1980's  and  beyond. 
With  the  growth  in  demand  for  commercial  integrated  circuits,  the  DOD  fraction  of  the  IC 
market  has  declined  to  less  than  10%  of  production.     The  reduced  market  has  resulted  in  less 
emphasis  on  military  needs,   such  as  built-in  testing,   fault  tolerance,  and  military  tempera- 
ture specifications.     The  VHSIC  program  is  aimed  at  production  of  the  needed  technology  and 
devices  to  meet  DOD  needs . 

The  VHSIC  program  is  divided  into  four  phases.     In  Phase  0,  starting  in  January  of 

1980,  nine  companies   (or  teams  of  companies)  made  preliminary  investigations  and  proposed 
programs  for  the  next  phase.     These  proposals  were  for  chip  set  development  and  for  brass- 
boards  demonstrating  the  applicability  of  the  chips  to  specific  military  systems.     In  May 

1981,  contracts  were  awarded  on  six  of  the  nine  proposals.     Each  of  the  six  Phase  I  con- 
tractors will  produce  a  chip  or  set  of  chips  and  will  demonstrate  a  brassboard  for  one 
application.     The  chips  will  have  1.25  micron  or  smaller  features,   arid  are  expected  to  pro- 
vide a  40-fold  increase  in  processing  throughput  over  present  ICs.     The  chips  will  be 
available  in  limited  quantities  in  1983-84  and  brassboards  will  be  demonstrated  in  early 
1984.     Phase  II,   starting  1984,   involves  reducing  feature  size  to   .5  micron  and  increasing 
throughput.     Phase  III  runs  concurrently  with  Phase  I  and  Phase  II,  and  provides  small- 
contract  technical  support  in  many  areas.     Over  40  of  these  contracts  have  been  assigned. 

The  VHSIC  program  is  pushing  forward  the  limitations  of  current  technology.     As  in 
all  such  programs,  there  is  a  degree  of  risk  in  meeting  schedules. 

The  information  in  this  paper  is  based  on  reports  of  varying  age;   it  is  primarily  de- 
rived from  the  VHSIC  specifications  Handbook  preliminary  edition  dated  January  1982,  and 
from  earlier  vendor  documents,  but  some  information  was  obtained  orally.     As  of  January 

1982,  some  chip  designs  were  not  yet  frozen.     The  author  or  contract  sponsor  accept  no 
liability  for  the  accuracy  or  timeliness  of  any  information  contained  here. 


VHSIC  Requirements 

During  the  remainder  of  this  century,  the  DOD  will  be  fielding  several  high  technology 
systems.     This  sophisticated  hardware  will  act  as  a  force  multiplier  in  a  conflict  with  a 
more  numerous  but  less  sophisticated  threat.     Implementing  the  signal  processing  hardware 
for  these  new  systems  in  custom  LSI  would  be  very  costly,  would  take  about  10  years  based 
on  the  current  speed  of  technology  insertion,  and  would  present  a  severe  logistical  problem 
in  stocking  parts.     The  need  for  a  better  solution  is  a  major  motivation  for  the  VHSIC  pro- 
gram . 

Rapid  technology  insertion  is  not  the  only  reason  for  VHSIC.     Chips  being  developed 
today  generally  do  not  include  fault  tolerance  and  built-in  test;  VHSIC  does.     In  addition, 
the  VHSIC  program  is  needed  to  encourage  development  of  MIL-spec  qualified  chips.  The 
VHSIC  chips  are  also  strongly  oriented  toward  high  speed  signal  processing  functions. 

It  is  not  enough  to  have  high-throughput  chips;  the  chips  must  be  connected  to  form 
some  useful  device.     Therefore,   some  two  dozen  applications  were  specified  for  the  compe- 
titions to  demonstrate  their  chip  sets.     The  vendors  typically  proposed  a  half-dozen  appli- 
cations out  of  the  list . 
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The  six  systems  selected,  the  principal  vendors,  and  the  chips  used  in  Phase  I  are 
as  follows : 

Honeywel 1 

Electro-Optical  Signal  Processor 

Parallel  Programmable  Pipeline  (PPP) 
controller 

Hughes 

Battlefield  Information  Distribution  System  (BIDS) 
Correlator 

Algebraic  Encoder/Decoder 
Spread  Spectrum  Subsystem 

IBM 

Acoustic  Signal  Processor 

Complex  Multiplier  Accumulator  (CMAC) 

Texas  Instruments 

Multimode  Fire  and  Forget  Missle 

Static  Random  Access  Memory  (SRAM) 
1750A  data  processor 
Vector  Arithmetic  Logic  Unit  (VALU) 
Vector  Address  Generator  (VAG) 
Array  Controller/Sequencer  (ACS) 
Multipath  Switch  (MPS) 

2  generic  gate  arrays   (programmable  system  component  -  PSC) 

TRW,  Motorola,   Sperry  Univac 
EW  Signal  Processor 

Content  Addressable  Memory  (CAM) 
Window  Addressable  Memory  (WAM) 
4-Port  Random  Access  Memory  (4PRAM) 
Register  Arithmetic  Logic  Unit  (RALU) 
Multiply  Accumulator  (MAC) 
controller 

Address  Generator  (AG) 
Matrix  Switch  (MS) 

Westinghouse ,  National  Semiconductor,  Control  Data  Corporation,  Harris, 
Carnegie-Mellon  Institute 

Advanced  Tactical  Fighter  Radar  Processor 

Static  Random  Access  Memory  (RAM) 

Arithmetic  Unit  (AU) 

Extended  Arithmetic  Unit   (EAU,  EXAU) 
Pipeline  Arithmetic  Unit  (PLAU) 
controller 

VLSI  gate  array   (not  VHSIC) 

Chip  capabilities 

The  gate-Hz  product  and  operations  per  second  (ops)  specification  are  not  useful  for 
selection  of  a  particular  chip  or  module  for  a  specific  application.     For  example,  a  memory 
rated  at  100  MOPS  (million  ops  per  second)  would  be  somewhat  ineffective  for  performing 
arithmetic  operations.     Some  other  measure  of  capability  is  needed. 

Evaluating  classes  of  chips  separately  and  considering  the  ability  of  individual  chips 
within  the  class  to  perform  appropriate  tasks  seems  to  be  a  better  approach.     For  example, 
cycle  time  and  size  are  appropriate  measures  for  a  memory,  but  multiply  rate  is  more  appro- 
priate for  an  arithmetic  unit.     The  chips  may  be  divided  into  two  general  classes  of  the 
arithmetic  and  the  non-arithmetic  chips. 

Non-arithmetic  chips 

The  non-arithmetic  chips  may  be  further  subdivided  among  memory,  control,   "glue",  and 
other  chips.     Three  memory  chips  will  be  available.     If  speed  is  the  overriding  factor,  then 
the  4 0  MHz  Westinghouse  chip  or  perhaps  the  TI  chip  without  error  correcting  code  would  be 
best  suited.     If  throughput  is  most  important,  then  the  TRW  4-port  memory  may  be  a  better 
choice.     For  error  correcting,  the  TI  chip  is  the  best  choice. 

The  four  controllers  are  probably  best  used  with  the  chips  which  they  are  intended  to 
control.     That  is,   the  choice  of  a  controller  chip  should  probably  be  made  after  the  choice 
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of  an  arithmetic  chip. 

The  "glue"  chips  are  intended  to  connect  together  the  other  components  to  form  a  system. 
Simple  data  paths  generally  require  no  support     chips,  but  more  complex  structures  and  con- 
nection to  other  modules  do  require  specialized  chips.     Choices  for  this  support  range  from 
custom  VHSIC  to  custom  LSI/VLSI  to  general  purpose  MSI;  the  decision  among  these  choices 
must  be  based  on  application  requirements.     Two  companies  are  offering  a  multiple-connection 
switching  array,  TI  and  TRW,  but  with  slight  differences.     The  TI  (tentative)   chip  will  use 
one-fourth  as  much  power,  but  may  not  be  suited  to  many  applications.     Westinghouse  and  TI 
also  offer  generic  gate  arrays. 

The  three  specialized  Hughes  chips  are  best  suited  to  their  intended  tasks.  Similarly, 
the  TRW  addressable  memory  chips  are  appropriate  to  their  tasks.     The  remaining  chips  could 
probably  be  used  with  products  of  other  companies,  given  the  proper  support.     For  example, 
IBM  will  produce  only  an  arithmetic  chip,  which  requires  memory  and  an  address  sequencer  to 
perform  almost  any  other  function  than  data  filter. 

Arithmetic  chips 

The  arithmetic  chips  could  be  divided  trivially  between  the  one  microprocessor  and  the 
other  eight  units.     This  distinction  does  not  justify  a  separate  listing,  but  should  be  kept 
in  mind  when  comparing  the  TI  1750A  performance  to  the  others. 

Only  Westinghouse  definitely  has  floating  point,  but  TI  might  add  it  to  the  1750A  if 
microcode  space  allows.     The  three  Westinghouse  arithmetic  chips  are  approximately  equiv- 
alent except  in  speed.'     For  speed,  the  40  MHz  PLAU  is  fastest  followed  by  the  AU  and  chips 
of  three  other  vendors  at  25  MHz.     The  EAU  has  an  effective  performance  of  12.5  MHz,  but  it 
is  a  floating  point  and  32  bit  unit.     The  speed  rating  of  the  Honeywell  PPP  is  too  applica- 
tion-dependent to  estimate,  but  is  between  5  MHz  and  200  MHz  for  arithmetic  operations. 
The  TI  1750A  and  the  TRW  RALU  are  slowest. 

Once  again,  these  performance  results  are  for  data  available  when  needed  at  all  input 
ports.  If  the  module  design  offers  less  than  this  perfect  support,  then  the  chip  capabil- 
ities may  be  degraded.     Details  on  chip  performance  are  beyond  the  scope  of  this  paper. 

Analysis  of  module  capabilities 

This  section  analyzes  the  capabilities  of  the  modules.     Only  the  arithmetic  modules 
are  examined  in  depth  here;   the  Westinghouse  bulk  memory  and  bus  controller  modules  are  not 
amenable  to  arithmetic  performance  measures.     These  two  non-arithmetic  units  strongly 
influence  system  capability.     The  16  module  capacity  of  the  Westinghouse  ring  bus  controller 
limits  the  size  of  a  ring;   if  more  than  16  modules  .are  needed  to  perform  a  function,  then  a 
second  ring  and  bus  interface  modules  must  be  employed.     Analysis  of  this  more  complex 
structure  is  greatly  beyond  the  scope  of  this  paper. 

Task  definitions 

Six  tasks  in  nine  variations  are  considered  here  as  measures  of  module  performance. 
The  tasks  are  selected  to  resemble  common  signal  processing  operations.     Since  most  of  the 
applications  of  interest  involve  signal  processing  rather  than  general  computation,  test- 
and-branch  or  search-f or-value  tasks  are  not  examined  in  this  paper.     In  all  tasks,  the 
possibility  of  overflow  is  ignored.     Pipelined  operation  is  allowed  if  supported  by  the 
hardware.     Modules  having  parallel  I/O  capability  are  evaluated  for  both  compute  speed  and 
I/O  speed. 

The  first  task  is  a  complex  FFT  1024  long  on  16  bit  data.     Radix  2  or  radix  4  may  be 
chosen  as  appropriate.     The  butterflies  are  computed  using  16  bit  arithmetic  throughout. 

The  second  task  is  a  sum  of  products  256  long.     This  corresponds  to  a  dot  product  or 
a  correlation  or  an  IIR  filter.     One  of  the  two  vectors  must  be  input,  but  the  other  may  be 
stored  in  memory.     There  are  two  variants  of  this  task,  with  8  or  16  bit  products;  the  sum 
is  always  16  bits. 

The  third  task  is  a  city-block  distance  or  sum-of-absolute-dif ference  measure.     One  of 
the  1024-long  vectors  must  be  input,  but  the  other  may  already  be  in  memory. 

The  fourth  task  is  a  vector-matrix  product.     A  64-long  vector  is  multiplied  by  a  64- 
square  coefficient  matrix  to  give  a  new  64-long  vector;  this  is  a  series  of  sum-of-product 
operations  on  the  same  input  data.     The  task  corresponds  to  various  transformations  such  as 
the  real  DFT.     Coefficients  may  be  in  memory  already  if  sufficient  locations  are  available. 
This  task  has  variations  for  8  and  16  bit  products;  the  sum  is  always  16  bit.- 
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The  fifth  task  is  a  complex  vector-matrix  product.  It  differs  from  the  fourth  task 
only  in  the  use  of  complex  data  and  coefficients.  The  coefficients  are  used  in  the  same 
two  variants . 

The  final  task  is  to  compute  one  32  bit  complex  radix  2  butterfly.     The  32  bit 
arithmetic  is  excessi  e  for  many  applications,  but  does  provide  a  fair  test  of  double 
precision  speed.     In  pipeline  structures,  the  pipeline  rate  is  reported  rather  than  the 
time  needed  to  compute  a  single  butterfly  in  isolation. 

For  all  cases,  it  is  assumed  that  data  may  reach  the  module  at  the  appropriate  times; 
addressing  and  other  overhead  may  result  in  slower  performance. 

Performance 

Table  I  summarizes  the  speed  performance  of  the  modules  on  the  tasks  and  gives  an  over- 
all speed  rating.     This  overall  value  is  the  geometric  mean  of  the  ratios  of  module  speed  to 
best  speed  on  each  of  the  nine  tasks.     If  no  ratio  is  possible,  a  value  of  100  is  used.  The 
size  column  indicates  the  approximate  module  size  in  units  of  15  watts  and  15  cubic  inches. 
The  modules  are  listed  in  three  groups:   general  purpose  modules,  hypothetical  modules,  and 
high  performance  modules.     Within  each  group,  modules  are  listed  alphabetically  by  vendor. 
More  detailed  descriptions  of  speed  derivations  follow. 

Honeywell .     The  Honeywell  PPP  module  is  not  of  comparable  size  to  the  other  modules. 
To  allow  a  more  fair  comparison  of  computing  power  for  a  given  volume  and  heat  dissipation, 
a  simple  module  was  hypothesized.     The  module  consists  of  enough  PPP  chips  to  provide  32 
processors  in  parallel,  a  bulk  memory,  and  a  crossbar  switch  for  alternate  data  paths.  The 
clock  is  33  MHz. 

The  FFT  requires  about  36000  cycles  or  1  ms  for  one  simple  processing  algorithm,  but 
there  appears  to  be  a  more  complex  valid  algorithm  requiring  only  530  us.     The  32  bit 
butterfly  requires  about  4200  ns .     The  architecture  was  not  totally  defined  in  the  documents 
used  in  generating  this  paper.     The  sum  of  products  requires  77  or  138  us  for  the  8  and  16 
bit  products,   respectively.     The  sum  of  absolute  differences  is  I/O  limited  to  one  processor 
and  so  uses  93  us.     The  vector-matrix  product  task  allows  parallel  computation  on  the  same 
input  using  local  coefficients.     The  real  cases  require  35  or  65  us;  the  complex  cases  use 
74  or  626  us. 

IBM .     The  IBM  brassboard  is  much  larger  than  the  other  modules,  so  a  smaller  module  was 
hypothesized.     Early  reports  indicated  that  the  chip  will  not  support  absolute  value  or  32 
bit  arithmetic,  so  these  tasks  are  not  included.     The  hypothetical  module  behaves  primarily 
as  a  filter,   and  is  frequently  limited  by  data  paths  into  and  out  of  the  module. 

Texas  Instruments.     The  two  TI  modules  are  designed  for  differing  purposes.     The  1750A 
is  a  general  purpose  computer  with  interrupt  handling  and  other  features,  while  the  VALU 
module  is  a  fast  arithmetic  unit. 

The  1750A  module,  although  fast  by  current  standards,  is  slower  than  most  other  VHSIC 
modules.     A  16  bit  FFT  butterfly  requires  about  2960  ns,   and  the  1024  FFT  requires  15155  us. 
A  32  bit  butterfly  would  require  about  11920  ns .     The  sum  of  products  tasks  would  require 
194  and  236  us.     The  sum  of  absolute  differences  needs  6l4  us.     With  only  one  processor, 
there  is  little  exploitable  parallelism  for  the  vector-matrix  products.     However,  it  is 
possible  to  simultaneously  accumulate  8  complex  sums  or  16  real  sums  in  the  20  registers. 
(The  1750A  standard  requires  20  general  purpose  registers.)     The  16  real  sums  would  require 
540  and  704  us  for  the  8  and  16  bit  products  respectively.     This  execution  time  may  be 
compared  to  the  655  and  819  us  needed  using  the  brute  force  approach,  giving  a  speed  improv- 
ment  of  14  to  ±8%.     This  16  sum  operation  must  be  repeated  four  times  to  process  all  64 
answers.     Similarly,   the  complex  vector-matrix  products  would  require  8.6  and  11.3  ms ,  which 
is  about  a  30%  improvement  over  separate  computation. 

The  VALU  module  is  much  faster  than  the  1750A  module.     Three  memory  banks  support  the 
VALU  chip.     A  16  bit  butterfly  requires  120  ns,   so  the  1024  FFT  takes  6l4  us.     A  32  bit 
butterfly  requires  about  640  ns .     The  sum  of  products   requires   40  ns  per  term  using  multiple 
computing  sections.     The  sum  of  absolute  differences  also  requires  40  ns  per  term.  Total 
times  for  these  tasks  are  10.24  and  40.96  us  respectively.     The  vector-matrix  products  may 
be  computed  in  164  us  for  real  and  655  us  for  complex  operations. 

TRW ♦     The  original  TRW  MSP  has  about  the  same  processing  power  as  the  new  modules  for 
the  tasks  considered  here.     An  additional  RALU  or  a  WAM  chip  will  not  greatly  alter  the 
processing  capability  for  the  nine  arithmetic  tasks;  therefore,  only  the  original  MSP  is 
considered  here.     Assume  that  the  setup  time  is  only  needed  if  the  switch  is  reconfigured. 
That  is,  assume  that  instantaneous  changes  are  possible  within  the  MAC  and  RALU  chips.  If 
this  assumption  is  not  valid,  then  the  times  reproted  for  modt  tasks  should  be  doubled. 
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MODULE 

PFT 

:'■[•  ••' 

SP16 

SD 

VAo 

VA16 

Co 

Clo 

BFY 

SIZE 

TOTALS 

TI  1750A 

1520 

194 

236 

614 

2161 

2816 

8643 

11264 

1192 

1 

69.7  .01 

TRW  MSP 

82 

10 

10 

41 

164 

164 

655 

655 

96 

2 

4.3  .23 

W  VSP 

82 

10 

10 

41 

164 

164 

655 

655 

64 

1 

4.1  .24 

Ho  PPP* 

53 

77 

138 

93 

35 

65 

74 

626 

420 

1 

5.2  .19 

IBM  CMAC* 

51 

5 

5 

102 

102 

204 

204 

1 

3.0  .34 

TI  VALU 

61 

10 

10 

41 

164 

164 

655 

655 

64 

1 

4.0  .25 

W  PPP 

41 

6 

6 

26 

82 

82 

328 

328 

16 

2 

2.0  .50 

W  CAVP 

13 

6 

6 

26 

35 

35 

102 

102 

10 

2 

1.1  .92 

Units : 

lOus 

us 

us 

us 

us 

us 

us 

us 

10ns 

Legend:  Total  is  the  geometric  mean  of  the  ratio  of  unit  speeds  to  the  best  speeds,  and  its 
reciprocal;  size  is  in  units  of  about  15  watts  and  15  cubic  inches.     'hypothetical . 


Table  I.     Module  computation  times. 

The  16  bit  FFT  butterfly  can  be  performed  in  a  pipelined  fashion  in  819.2  us.     A  32  bit 
butterfly  requires  about  960  ns .     The  sum  of  products  can  be  computed  completely  in  the  MAC 
in  10  us.     If  one  of  the  terms  is  also  stored  in  the  ROM,  then  it  is  possible  to  compute 
part  of  the  sum  in  the  RALU;  however,  this  would  save  only  about  5%  on  computation  time  and 
would  greatly  increase  programming  difficulty.     The  sum  of  absolute  difference  task  is 
performed  in  both  units;   the  difference  may  be  computed  in  the  MAC  and  the  absolute  sum 
computed  in  the  RALU  at  (perhaps)   4l  us  total.     It  is  assumed,  without  evidence  either  way, 
that  absolute' lvalue  is  one  of  the  functions  implemented  in  the  RALU.     The  vector-matrix 
products  may  be  computed  by  the  same  algorithm  as  the  sum  of  products,  and  requires  164  us. 
The  complex  case  requires  655  us. 

Westinghouse .     Only  three  Westinghouse  arithmetic  units  are  considered  in  detail  here. 
The  BMEM,  BC,  and  BI  are  not  arithmetic  units  and  hence  may  not  be  appropriately  tested. 
The  1750A  GP  was  not  documented  in  sufficient  detail  for  analysis. 

The  arithmetic  units  are  limited  by  bus  I/O  speed,  by  internal  connections  among  chips, 
and  by  the  speed  of  the  arithmetic  chips.     It  is  assumed  here  that  data  enters  through  a 
single  port  and  that  there  is  no  memory-I/0  conflict  in  accessing  the  data.     As  long  as  a 
task  takes  at  least  one  cycle  per  data  value,  .there  is  no  I/O  limitation  on  module  speed. 

The  Vector  Signal  Processor  (VSP)   is  the  slowest  of  the  three.     Based  on  reports,  it 
requires  819  us  for  the  1024  FFT.     A  32  bit  butterfly  takes  about  640  ns .     The  sum  of  product 
and  sum  of  absolute  differences  require  one  cycle  per  term  for  10  and  4l  us  respectively. 
Based  on  internal  structure  and  specifications,  the  vector-matrix  product  requires  164  us 
total.     Similarly,  the  complex  product  can  take  no  less  than  655  us. 

The  Floating  Point  Processor  (FPP)  is  intended  for  32  bit  fixed  and  floating  point 
arithmetic,  but  may  perform  16  bit  fixed  point  operations.     It  has  two  AUs  which  are  assumed 
fully  functional.     This  assumption  is  based  on  information  provided  by  Westinghouse  at  a 
briefing.     The  FPP  has  about  twice  the  speed  of  the  VSP  for  16  bit  real  operations.  For 
floating  point  operations,  the  ratio  is  slightly  lower.     The  32  bit  butterfly  is  estimated 
at  160  ns .     Since  the  I/O  limit  still  holds,  the  sum  of  products  requires  6.4  us  and  the 
sum  of  absolute  value  takes  25.6  us. 

The  Complex  Arithmetic  Vector  Processor  (CAVP)  module   is  a  40  MHz  processor  using  two 
PLAU  and  one  AU.     It  is  assumed  that  the  AU  runs  at  20  MHz  instead  of  at  25  MHz  asynchronous 
to  the  PLAUs.     Data  may  come  from  the  memory  by  three  paths  and  from  the  I/O  buffer  by  one 
path;  assume  that  the  structure  of  memory  and  buffer  will  allow  three  and  one  accesses  per 
cycle  respectively. 

Based  on  vendor  specifications,  the  1024  FFT  will  require  128  us.     A  32  bit  butterfly 
requires  about  100  ns .     The  sum  of  products  could  be  performed  on  several  multipliers  simul- 
taneoulsy,  but  the  I/O  limit  is  still  256  cycles  or  6.4  us.     The  sum  of  absolute  differences 
is  more  challenging.     The  difference  and  the  absolute  value  may  be  taken  in  one  PLAU,  and 
this  result  passed  through  memory  to  accumulate  in  the  second  PLAU,  giving  a  pipelined  rate 
of  one  term  per  cycle  or  25.6  us  for  the  task. 

The  vector-matrix  product  may  be  accumulated  in  several  simultaneous  paths  requiring 
35  us.     The  complex  vector-matrix  product  requires  one  cycle  per  term  or  102  us. 

Resu1 ts 

As  expected,  the  1750A  is  the  slowest  module,  with  less  than  one-fiftieth  the  speed  of 
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the  fastest  standard  module,  the  CAVP.     The  1750A  is  more  than  an  order  of  magnitude  slower 
than  the  other  two  general  modules.     The  MSP  and  VSP  have  almost  identical  performance, 
differing  only  on  double  precision  arithmetic. 

The  faster  hypothetical  module,  the  IBM  CMAC,  is  rated  comparable  to  the  general  pur- 
pose modules  in  speed;   it  is  as  fast  as  the  high  performance  modules  on  most  tasks,  but 
cannot  perform  double  precision  arithmetic  or  an  absolute  value  function.     The  Honeywell 
module  is  the  fastest  module  for  some  tasks,  up  to  30%  faster  than  the  CAVP;  however,  much 
of  this  speed  gain  is  due  to  coefficients  being  stored  in  element  memory.     It  must  be  re- 
membered that  these  two  modules  are  hypothetical,  and  would  only  exist  if  customers  were 
willing  to  fund  development. 

The  VALU  is  listed  with  the  high  performance  modules  because  it  is  specialized.  While 
it  is  much  faster  than  the  other  TI  module,   it  is  only  slightly  faster  than  the  VSP  and  MSP, 
and  then  only  for  the  FFT.     The  CAVP  is  the  fastest  module  in  the  group,  about  twice  as  fast 
as  the  FPP  and  four  times  as  fast  as  the  VALU. 

The  Westinghouse  ring  bus  is  a  significatn  factor  to  consider.     The  bus  allows  great 
flexibility,  graceful  degradation,  and  real-time  reconfiguration.     On  the  other  hand,  the 
bus  structure  has  several  disadvantages  over  modules  interconnected  in  a  more  dedicated 
manner.     These  disadvantages  include  structural  complexity  in  larger  systems  (multiple 
rings),  a  larger  minimum  system  (a  BC  is  mandatory),  more  complex  data  input  modules  to 
allow  packets,  and  packet  transmission  delays.     Examination  of  one  specific  application 
indicates  that  about  25%  more  space  and  15%  more  time  is  required  for  use  of  the  bus-  on 
the  other  hand,  the  system  will  still  run  if  one  arithmetic  unit  becomes  unusable.  It 
might  be  possible  to  adapt  Westinghouse  modules  to  a  different  bus  structure  by  changing  the 
module  interface  chips. 

Cost  is  also  a  factor.     None  of  the  vendors  have  yet  determined  the  cost  of  new  items, 
but  most  seem  to  accept  some  approximations   (within  a  factor  of  five)  as  reasonable.  Chips 
will  cost  about  $250  each.     Boards  may  cost  about  $500.     Design  costs  are  $100K  for  a  CGA  or 
$500K  for  a  new  module  using  existing  chips.     New  chips  cost  about  $25  per  gate  to  design. 
Reprogramming  an  existing  module  would  only  cost  about  $100K.     Thus  it  is  frequently  more 
effective  to  reprogram  an  existing  module  than  to  design  a  new  one;   it  is  also  generally 
faster . 


The  approximate  cost  per  module  is  as  follows: 


MODULES 

CHIPS 

BOARDS 

DESIGN 

PARTS 

PPPa 

19 

1 

$  500K 

$  5250 

PPPb 

26 

1 

500K 

7000 

PPPb  ' 

26 

1 

500K 

7000 

CMAC 

23 

1 

500K 

6250 

1750A 

13 

1 

3750 

VALU 

15 

1 

4250 

MSP 

33 

2 

9250 

VSP 

21 

1 

5750 

CAVP 

HO 

2 

12000 

Assumed  costs  are  $250  per  chip,   $500  per  board,  and  $500,000  for  design  of  a  new  module. 
A  programming  cost  of  $100,000  per  board  applies  only  to  reuse  of  an  existing  board.  These 
estimates  should  not  be  in  error  by  more  than  a  factor  of  five. 

It  should  be  emphasized  again  that  only  certain  arithmetic  operations  are  tested  in  the 
benchmark  functions  used  here.  Branching,  searching,  and  general  mixed  instructions  are  not 
considered.  The  speeds  given  are  design  maximum  values;  overhead  may  result  in  slower  speeds . 
Results  may  be  different  for  other  test  measures. 

Conclusions 

The  choice  of  modules  for  a  particular  system  depends  on  several  factors.     Among  these 
factors  are  algorithm  complexity,  required  speed,  physical  space  available,  and  number  of 
systems  to  be  produced.     If  many  units  will  be  produced,  then  design  of  a  new  module  may  be 
justified;   if  five  units  are  planned,  then  existing  modules  should  probably  be  chosen. 

For  small  expendable  systems  such  as  intelligent  bombs,  a  specialized  system  may  be 
a  better  choice  than  the  more  adaptive  ring  bus.     Similarly,  if  small  physical  size  is  a 
limiting  factor,  the  Westinghouse  bus  should  probably  not  be  used.     The  TI  1750A  is  the 
smallest  independent  control  module  at  8  watts  and  8  cubic  inches i  the  TI  VALU  is  the 
smallest  powerful  arithmetic  unit.     The  Honeywell  modules  are  small  and  powerful,  but  would 
have  high  design  costs. 
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For  floating  point  arithmetic,  Westinghouse  is  the  only  choice.     The  selection  of  a 
particular  module  will  depend  on  the  computational  throughput  needed. 

For  a  general  purpose  computer,  the  1750A  is  probably  the  best  choice.     For  a  general 
purpose  module,  the  MSP  and  VSP  are  better. 

Beyond  these  generalizations,   little  can  be  said.     Module  selection  is  too  system  de- 
pendent to  make  many  absolute  statements. 
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Abstract 


A  highly  concurrent  architecture  for  a  high-speed  digital  signal  processing  engine  is 
described.  Initially,  a  repetitive  structure  of  identical  processing  elements  which 
realizes  a  one-dimensional  state  variable  filter  is  defined.  This  structure  uses  primarily 
local  communication  with  topologically  simple  interconnections  among  the  processing 
elements.  It  is  then  shown  that  these  relatively  simple  processing  engines  can  be 
combined,  again  in  a  regular  manner,  to  solve  two-dimensional  filter  problems.  Application 
of  this  basic  filter  structure  to  provide  high-speed  solutions  for  other  standard  signal 
processing  transforms  such  as  the  Discrete  Fourier  Transform  ,  the  Chirp  Z  Transform,  and 
the  Running  Discrete  Discrete  Transform  is  also  presented. 


I .  Introduction 


Let  u( . )  denote  the  input  sequence  to     a     digital     filter     and     let     y( . )     denote  the 
resulting  output  sequence.     Filter  operation  for  many  applications  proceeds  as  follows: 

(i)     accept  u 

(ii)     y  =  yp  +  D  u 

(iii)     transmit  y 

/ •    \       new       .  old 

(iv)     x        =Ax  +Bu 

(v)  yp 

.    . ,        old  new 
( vi )      x         =  x 


(vii)     Return  to  (i)   to  await  next  input. 

The  vectors  x0"*"^  and  xnew  are  usually  of  dimension  n  so  that  step  (iv)  requires 
matrix  multiply  operations.  It  is  clear  that  the  maximum  rate  at  which  the  filter  can 
operate  is  determined  primarily  by  the  speed  at  which  steps  (ii)  -  (v)  can  be  performed  and 
that  high  performance  structures  must  implement  some  form  of  pipeline  processing  to 
minimize  computation  time  for  the  matrix  multiply/add  operations.  Moreover,  a  general 
filter  architecture  must  have  the  capability  of  realizing  filter  functions  for  a  range  of 
possible  filter  dimensions  and  it  must  accept  externally  supplied  coefficient  matrices,  A, 
B,  C,  and  D.  In  Kalman  Filtering  operations,  the  coefficient  matrices  can  also  depend  upon 
the  independent  variable(s)  such  as  time  or  space  and  thus  the  coefficient  matrices  must  be 
continuously  variable  from  external  data. 


The  purpose  of  this  paper  is  to  describe  a  filter  architecture  that  achieves  the  above 
objectives  subject  to  the  standard  VLSI  constraints  of  highly  regular  structures  with 
predominantly  local  connections.  In  Section  II,  the  proposed  architecture  is  described  and 
its  functional  characteristics  are  interpreted  for  the  standard  one-dimensional  filtering 
application.  A  high-performance  structure  for  two-dimensional  filtering  based  upon  a 
hierarchical  imbedding  of  the  basic  signal  processing  architecture  is  described  in  Section 
III.  Finally  the  use  of  the  proposed  architecture  for  other  applications  such  as  the  DFT 
and  CZT  is  considered  in  Section  IV. 

II.     Filter  Architecture 


There  are  many  different  minimal  realizations  of  the  form 


X(k+1 )    =  A  X(k)   +  b  u(k) 
y(k)   =  C  X(k)   +  d  u(k) 

for  a  given  filter  transfer  function,  G(z)  =  Y(z)/U(z)  where  U  and  Y  are  scalers.  If  one 
proceeds  with  a  direct  implementation  of  the  above  equations,  the  matrix  product  indicated 
as  A  X(k)   requires     0(n   )     multiply-add  operations  where     n     is     the     filter     order.  This 
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implementation  on  a  single  von-Neumann  processor  incurs  a  minimum  delay  between  sample 
points  that  is  also  of  0(n  ),  thereby  greatly  constraining  the  filter  bandwidth  that  can 
be  achieved.  This  problem  is  usually  avoided  in  practice  by  choosing  a  canonical  structure 
such  as  the  phase  variable  form,  the  Jordan  form  or  its  derivatives  for  implementation  on  a 
von  Neumann  processor.  The  use  of  canonical  structures  reduces  minimum  delay  between 
sample  points  to  0(n)   at  the  cost  of  a  highly  constrained  filter  structure. 

It  is  possible  to  provide  a  fast,  general  filter  by  using  multiple  processing  elements 
fabricated  with  VLSI  technology.  The  key  to  building  array  processing  machines  in  VLSI  is 
the  ability  to  formulate  a  repetitive  structure  of  processing  elements  in  which  all 
elements  are  identical,  communication  is  local,  and  interconnection  among  the  processing 
elements  is  topologically  simple.  Large  integrated  circuits  require  prohibitive  design, 
layout,  and  test  times  unless  regular,  repetitive  structures  are  used.  If  a  single 
processing  element  can  be  designed  and  used  repeatedly  in  a  processing  array  structure, 
design  of  large  scale  processing  machines  becomes  tractable. 

Local  communication  is  an  important  requirement  for  large  scale  array  processing 
machines  for  two  primary  reasons.  The  first  involves  the  area  required  for  non-local 
communication  paths.  For  example,  many  microprocessor  chips  use  a  substantial  part  of 
their  layout  area  for  communication  busses.  As  a  greater  number  of  processing  elements 
with  system  wide  communication  are  added,  the  communication  related  area  dominates  all 
other  area  requirements.  A  second  potential  limitation  of  non-local  communication  derives 
from  the  delay  in  broadcasting  data  over  a  signal  line  with  resistance  and  capacitance. 
Signal  and  clock  skew  are  already  one  of  the  limiting  factors  in  the  design  of 
supercomputers  [1].  As  a  given  technology  is  scaled,  allowing  larger  processing  arrays  on 
an  integrated  circuit  die,  the  resistance  of  conductors  increases  while  the  signal  lines 
for  global  communication  become  relatively  longer.  Ultimately,  this  provides  a  limit  on 
the  processing  power  of  a  globally  interconnected  structure. 

Topologically  simple  interconnections  are  a  necessity  for  large  integrated  circuit 
array  processors.  Typical  integrated  circuit  technologies  have  two  or  three  conductive 
layers  which  may  be  used  for  interconnections.  Only  one  or  two  of  these  layers  provide  low 
resistance  paths  for  minimum  delay  signalling.  Layout  of  interconnections  between  circuits 
requires  a  substantial  percentage  of  the  design  time  for  integrated  circuits.  Automatic 
routing  programs  and  methodologies  are  only  partially  successful  in  minimizing  this  time 
[2].  Interconnection  problems  for  arrays  of  processors  can  be  minimized  through  a  design 
methodology  we  call  "interconnection  by  default".  Each  processing  element  is  designed  so 
that  all  input  and  output  signal  paths  match  when  identical  modules  are  placed  adjacently. 
All  interconnect  efforts  are  part  of  the  local  design  of  the  processing  element.  Array 
interconnections  occur  by  default  when  the  processing  elements  are  placed  in  their  array 
structure . 

The  primary  incentive  for  building  large  scale  processing  arrays  is  to  trade 
processing  elements  for  calculation  time.  If  a  signal  processing  algorithm  requires 
0(n  )  time  in  a  single  processor  system,  it  can  normally  be  reconfigured  to  require  only 
0(n)  time  if  n  processing  elements  are  properly  oriented  to  solve  the  problem.  Figure  1 
shows  a  connection  of  processing  elements  to  compute  an  iteration  of  a  state  variable 
filter  in  O(n)  time.  All  the  processing  elements  are  identical,  communication  is  primarily 
local,  and  the  elements  are  designed  for  interconnection  by  default.  The  processor  array 
precalculates  the  next  state  such  that  the  input  u  must  precede  the  output  y  by  only  one 
calculation  time. 

Coefficients  of  the  state-transition  matrix  are  held  in  circular  buffers  contained  in 
each  processing  element.  Simple,  serial  loading  suffices  for  time-invariant  coefficients. 
Alternately,  the  basic  structure  can  allow  time-varying  coefficients  to  be  loaded  in 
parallel  from  the  top  of  the  circular  buffers.  The  input  must  be  available  at  the  left 
side  of  the  array  while  the  output  is  generated  at  the  right  side. 

The  processor  structure  of  Figure  1  is  general  purpose  and  can  be  used  for  other 
algorithms.  For  example,  with  processing  elements  capable  of  complex  arithmetic,  a  DFT  can 
be  solved  in  0(n)  time.  The  output  array  can  be  made  available  in  parallel  at  the  bottom 
of  each  of  the  processing  elements.  Or,  external  connections  can  be  minimized  by  serially 
shifting  the  output  array  through  the  leftmost  processing  element. 

Each  processing  element  is  a  simple  combination  of  storage  cells  and  arithmetic 
circuitry.  Figure  2  shows  the  internal  structure  of  a  processing  element.  A  circular 
shift  buffer  of  length  n  +  1  is  used  to  hold  the  coefficients  required  in  the  filter 
calculations.  The  circular  buffers  of  all  processing  elements  are  shifted  in  unison  to 
provide  the  coefficients  in  the  required  order  for  the  arithmetic  unit.  The  arithmetic 
unit  consists  of  a  multiplier  followed  by  an  adder  to  perform  the  generic  ax  +  b 
calculation  required  in  many  algorithms.  A  feedback  storage  register  (P)  is  provided  to 
delay     and     hold     the  output  of  the  current  calculation  for  input  to  the  adder  for  the  next 
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Figure  1.     General  block  diagram  of  process  element  array. 
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calculation.  A  shift  register  is  also  necessary  for  the  state-variable  filter.  When  the 
next  state  calculation  is  complete,  this  new  state  (X)  is  loaded  into  the  shift  register. 
As  calculation  of  a  next  state  proceeds,  the  present  state  is  shifted  to  the  left  to  be 
made  available  for  subsequent  calculations. 

In  order  to  illustrate  the  filter  operation,  a  discussion  of  steps  required  for  one 
transition  of  a  third-order  filter  is  given  below.  The  discussion  is  based  on  Figure  3. 
The  time  is  kT  +  t,  input  u(kT)  has  just  been  used  to  complete  the  calculation  of  the  next 
system  state  X(kT  +  T),  and  calculation  of  the  output  y(kT)  has  just  been  completed.  Note 
that  the  letter  t  is  used  to  indicate  the  time  for  a  single  parallel  calculation  by  the 
elemental  processors.  For  a  third-order  system,  a  time  of  4t  is  required  to  perform  all 
calculations  for  an  iteration  of  the  filter.  With  this  structure,  the  output  y(kT)  is 
available  after  only  one  parallel  calculation  (time  t)  by  the  elemental  processors. 

The  state  of  the  filter  at  time  kT  +  t  is  available  in  the  shift  register  shown  in  the 
lower  part  of  Figure  3.  The  state  variable  contained  in  the  leftmost  register  [X(l)]  is 
bussed  to  all  elemental  processors.  Each  elemental  processor  has  a  set  of  coefficients 
stored  in  a  circular  shift  register.  These  coefficients  are  shifted  once  for  each 
calculation.  During  each  interval,  the  bussed  state  variable  is  multiplied  by  the  next 
available  coefficient  from  each  shift  register  and  the  product  is  added  to  the  accumulator 
register  P.     The  accumulator  registers  were  cleared  prior  to  this  step. 

At  time  kT  +  2t,  the  state  variables  have  been  shifted  to  the  left  so  that  another 
state  variable  is  available  on  the  bus.  Also,  each  circular  shift  register  has  been 
shifted  by  one  position  so  that  a  new  coefficient  is  ready  for  input  to  the  multiplier. 
The  state  variable  and  corresponding  coefficients  are  multiplied,  added  to  the  contents  of 
accumulator  register  P,   and  then  used  to  update  the  contents  of  the  accumulator  registers. 

At  time  kT  +3t,  calculations  identical  to  those  of  the  previous  step  are  performed 
with  the  shifted  state  variable  and  the  shifted  coefficients.  The  result  is  again  added  to 
the  accumulator  registers.  It  is  important  to  recognize  that  each  processor  is  operating 
simultaneously  and  performing  identical  operations  in  each  of  these  steps. 

The  next  input  u(kT  +  T)  can  be  accepted  at  the  input  after  the  previous  calculation. 
An  external  multiplexer  connects  u(kT  +  T)  to  the  state  variable  bus  instead  of  connecting 
the  shift  register  to  this  bus  as  was  the  case  in  previous  steps.  The  coefficients  which 
are  available  at  this  step  include  the  B  vector  and  the  d  coefficient.  The  completion  of 
this  step  provides  the  output  y(kT  +  T)  and  finishes  the  computation  of  the  next  state 
X(kT  +  2T) .     The  process  is  repetitive  from  this  point. 

The  processor  organization  of  Figure  3  demonstrates  a  repetitive  structure  which  is 
highly  desireable  for  VLSI  fabrication.  The  'tick'  marks  delineate  cell  boundaries.  Cell 
interconnections  are  simple  consisting  of  the  state  variable  bus,  the  lower  shift  register 
interconnection,  and  clock  and  control  signals  (not  shown).  Control  is  simple  since  the 
processing  at  each  internal  step  is  identical.  The  structure  is  easily  expandable  to 
higher-order  systems  by  simple  addition  of  identical  multiplier  cells  along  one  axis  and 
addition  of  identical  stages  of  the  coefficient  shift  register  along  the  other  axis. 

The  architecture  of  the  processing  element  array  is  general  purpose  in  that  a 
structure  of  order  n+1  can  be  used  to  solve  problems  of  order  m  in  only  m  +  1  steps  where 
m  _<  n.  This  generality  is  accomplished  by  short  circuiting  the  circular  shift  path  for  the 
coefficients.  Each  of  the  circular  shift  buffers  contains  a  multiplexing  switch  along  with 
a  single  bit  indicator  (Figure  2).  The  indicator  bits  are  chained  together  as  a  shift 
register.  When  the  coefficients  are  loaded,  a  logic  ONE  is  established  at  the  last  active 
shift  buffer.  As  calculations  are  performed,  coefficients  are  shifted  up  to  this  point  and 
then  start  back  down  to  be  made  ready  for  the  next  operation. 


An  nth  order  processing  element  array  can  be  used  to  solve  a  smaller  mth  order  problem 
as  follows.  Operation  proceeds  as  described  for  the  three  by  three  example,  except  that 
results  from  the  m  leftmost  registers  only  are  used  in  calculating  the  next  state  while  the 
single  rightmost  processing  element  continues  to  perform  output  calculations.  A  control 
signal  is  provided  each  m  +  1  cycles  to  transfer  the  newly  calculated  next  state  into  the 
state  variable  registers  at  the  bottom  of  each  processing  element. 

III.     Two- Dimensional  Filtering 

A  standard   form  for  two-dimensional   linear  digital   filters   is  [3,4] 


xh(i+l, j! 
xV(i, j+1) 


_A1  A2 

V 

v  ,  .    .  . 

x  ( i, : ) 

+ 

_A3  \ 

_B2_ 

u( i , j  ) 


(1) 
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y(i,j) 


h,  .  ' 
xV( i , j ) 


where  both  u(i,j)     and  y(i,j)     are  taken  to  be  scalers.       Suppose  that 
2,    ...         and       x   (i,0),      i=0,      1,      2,    ...     are     given  and     that  u(i,j) 
available.     Propagation  of  the  state  update  calculation  in  (1)   occurs  in 
as  shown  by  Figure  4. 


(2) 


xh(0,j),      j=0,  1, 
i  >  0,    j   >  0  is 
a  wave-like  manner 


] 

A 


\ 

\  \ 

\\\ 

\     \    \  \ 

*  \    \ .  v 
\.     v     V  \ 


->  1 


Figure  4.     Calculation  order  for  two-dimensional  processing, 


Mathematically,  the  states  (i,j)  are  calculated  in  a  sequence  determined  by  an  integer 
k  where 

a)  i  _>   0,    j   >_  0 

b)  i  +  j  =  k,   k  =  1,    2,    . . . 

At  a  particular  value  of  k,  all  state  variables  can  be  simultaneously  computed  based  on 
values  of  x(i,j)   and  u(i,j)   at  k-1. 

In  many  applications,  only  one  of  the  independent  variables  will  have  infinite  range 
corresponding  perhaps  to  time.  Suppose  that  i  ranges  from  zero  to  infinity  and  that  j 
ranges  from  zero  to  N.  Then  a  computational  architecture  whose  fundamental  processing 
units  are  the  basic  linear  state  variable  filters  of  Section  II  can  be  easily  devised  so 
that  the  parallel  wavelike  structure  of  Figure  4  is  implemented.  In  general,  N+l  linear 
filters  having  primarily  local  interconnections  are  required  to  implement  the  parallel 
computations.  The  idea  is  illustrated  in  the  sequence  of  Figures  5a-5f  for  N=2 .  Each 
block  is  a  state  variable  filter. 


Note  that  processor  input  sequences  in  Figures  5a-5f  are  formed  columnwise  from  the 
input  data  sequence  by  juxtaposition  of  data  N  columns  apart.  For  example,  the  input  data 
sequence  for  processor  I  is  [u(0,0),  u(0,l),  u(0,2),  u(3,0),  u(3,l),  u(3,2),  u(6,0),...]. 
The  input  data  sequences  are  shifted  by  one  unit  of  basic  filter  cycle  time  for  each  of 
filters  II,  III,  etc.  The  interconnection  structure  is  local  between  adjoining  processors 
except  for  the  feedback  shown  between  processors  N+l  and  1.  (Perhaps  a  ring  geometry  could 
be  used  to  avoid  this  problem) .  Another  operational  characteristic  of  this  architecture  is 
the  fact  that  after  N  computations,  each  processor  zeros  the  computed  value  of  the  vertical 
state  variable  just  determined.  This  is  to  provide  for  the  restart  of  the  processor  for  a 
new  input  data  column. 

If  i  ranges  from  0  to  M,  and  j  ranges  from  0  to  N,  then  the  structure  of  Figure  5  can 
provide  the  output  computation  in  time  proportional  to  N+M+l  filter  operations.  In 
contrast,  a  single  von  Neumann  processor  would  require  execution  times  proportional  to 
(N+1)(M+1)   filter  operations. 
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Figure  5.     Computational  sequence  for  2-D  filter  with  N=2. 
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IV.     Transform  Applications 


In  the  preceding  two  sections,  applications  of  the  linear  filtering  structure  to 
time/spatial  domain  filtering  problems  were  described.  In  this  section  it  will  be 
established  that  the  proposed  structure  is  also  useful  in  frequency  domain  applications. 
Recall  that  the  Discrete  Fourier  Transform  for  a  sequence  x(k)  ,  k=0,  1,  .  ..,  N-l  is  defined 
to  be 

N-l 

r   nv 

X(n)   =  2_    x(k)   WN  n=0,    1,  N-l  (3) 

k=0 

where  WN  =  e~^21T'N.  (4) 

T 

Kung  and  Leiserson  in  [6]  point  out  thatTif  one  defines  X     =     [X(0),     X(l),  X(N-l)] 
and     x     =     [x(0),     x(l),2    .  ..,       x(N-l)]   ,     then     the     DFT  can  be  viewed  as  a  simple  matrix 
computation  requiring     N     multiply/add  operations.  Specifically 

X  =  Fx  (5) 

nk 

where  the  elements  of  F  are  complex  and  are  given  by  f  ^  =  .     It  is  shown  in     [6]  that 

by  use  of  parallel  processing  one  can  obtain  the  DFT  in  time  proportional  to  N,  rather 
than  N  Log2(N)  as  when  an  FFT  structure  is  used.  By  externally  altering  the  structure  of 
feedback  in  Figure  1,  the  proposed  architecture  can  be  used  to  determine  the  DFT  in  a 
processing  time  proportional  to  N.  The  reason  that  this  can  be  done  is  that  the 
architecture  of  Figure  1  basically  implements  a  matrix-vector  multiplication  operation  and 
(5)  indicates  that  this  is  precisely  what  is  required  by  the  DFT.  The  principal 
differences  are  that  a  feedback  structure  is  not  required  and  that  the  output  consists  of  a 
vector  quantity  X  rather  than  the  scaler  y  in  the  one-dimensional  filtering  application. 
Thus,  in  order  to  compute  the  DFT  the  state  component  feedback  paths  of  Figure  1  must  be 
opened  by  use  of  externally  controlled  switches.  This  will  allow  each  elemental  processor 
to  accumulate  one  component  of^rthe  DFT  sequence.  The  kth  circular  buffer  is  loaded  with 
the  sequence  of  coefficients  WN     ,   r=0,    1,  N-l     in     order  to     effect  the  required  DFT 

calculation.  The  sequence  x(k),  k=0,  1,  ...  ,  N-l,  is  then  applied  to  the  signal 
processor  and  the  partial  sums  accumulated  in  the  registers  of  each  elemental  multiply/add 
processor.       When     x(N-l)     has     been     processed,     X(n),      n=0,      1,  N-l     is  available 

simultaneously  in  these  registers.  If,  when  the  feedback  path  is  disconnected,  these 
registers  are  connected  to  output  pins,  then  X  will  be  available  in  time  of  0(N).  This 
option  may  be  tenable  for  small  N  but  the  number  of  pins  required  could  become  unmanageable 
for  large  N.  If  sequential  output  of  X(n)  is  acceptable,  then  the  DFT  can  be  obtained  in 
time  proportional  to  at  worst  2N  by  simply  closing  the  feedback  paths  after  N  calculations 
and  sequentially  reading  the  buffers  where  DFT  components  are  stored  from  the  left-hand 
port  of  Figure  1 . 

Although  the  above  development  has  emphasized  the  computation  of  the  Discrete  Fourier 
Transform,  one  could  also  compute  other  transforms  by  use  of  the  same  structure.  For 
example,   the  Chirp  Z  Transform  (CZT)   is  defined  by 

N-l 

X(zR)   =  ^    x(n)   A_n  Wnk,   k  =  0,  M-l  (6) 

n=0 

where,  W  =  W  e--^0  ,  A  =  A  e-^°  and  W  and  A  are  positive  and  real.  Once  again, 
the  CZT  is  °clearly  representable  as  a  matrix  multiplication  operation  and  the 
conclusions  of  the  DFT  discussion  therefore  apply  to  it  as  well. 

Another  application  of  the  basic  architecture  is  that  of  computing  Running  Discrete 
Transforms.  As  a  simple  example,  suppose  that  with  the  occurrence  of  each  new  data  sample, 
it  is  desired  to  generate  the  DFT  of  the  set  of  points  composed  of  the  previous  N-l  sample 
points  plus  the  most  recent  sample  point.  Then 

N-l 

X(p,n)    =  ^jT    x<P"k)   wNnk'      n=0<    N-l  (7) 

k=0 

It  is  straightforward  to  show  that 

X(p+l,n)   =  WNn  X(p,n)   +  [x(p+l)   -  x(p-N-l)]     n=0,  N-l  (8) 
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Let 


X(p)   =  [X(p,0)(   X(p,l),  X(p,N-l)] 


T 


(9) 


The  update  matrix  equation  for  the  running  DFT  is  therefore 


X(o+l)   =  T  X(p)   +P(x(p+1)   -  x(p-N-l) 
where  P  is  an  n  x  1  vector  each  of  whose  entries  is  a  one  and 


(10) 


T  =  Diag[W, 


n 


n=0, 


N-l]  . 


(11) 


N 


Equation  (10)  is  a  recursive  equation  similar  in  structure  to  the  basic  filter  of  Section 
II.  In  addition  to  the  requirement  of  complex  arithmetic  and  the  need  to  form  the  term 
(x(p+l)  -  x(p-N-l))  externally  prior  to  application  to  the  filter,  a  scheme  for  extracting 
X(p)  from  the  filter  similar  to  that  described  for  the  DFT  is  also  required. 
Generalizations  to  other  types  of  running  transforms  such  as  the  Haar  and  Hadamard 
Transforms  are  immediate  [5]. 


The  goal  of  this  paper  has  been  to  describe  a  first  step  toward  a  highly  concurrent 
and  programmable  signal  processing  architecture  that  can  be  efficiently  implemented  in  VLSI 
technology.  An  architecture  was  described  that  allows  one  to  implement  a  pipeline 
realization  for  general  purpose  single  variable  digital  filters.  It  was  then  shown  that 
this  architecture  can  be  used  hierarchically  to  implement  two-dimensional  filters  and  that 
if  the  elementary  processor  is  provided  with  complex  arithmetic  capabilities,  then  via 
simple  reconfiguration,   the  architecture  can  be  used  to  compute  various  transforms. 

In  this  paper,  we  have  not  attempted  to  address  the  important  technological  questions 
regarding  the  number  of  elementary  processors  that  can  be  placed  on  a  single  chip. 
Subcircuits  which  are  applicable  to  this  structure  have  been  fabricated  in  NMOS  in  a 
multiproject  chip  through  a  Texas  A&M  University/  Texas  Instruments  cooperative  venture. 
Based  on  results  from  this  work,  we  speculate  that  one-micron  technology  will  allow  the 
realization  of  several  processors  per  chip. 


[1]  R.     D.     Levine,    "Supercomputers,"  Scienti f ic  American ,   vol.     246,   pp.     118-135,  Jan. 


[2]  J.  Werner,  "Software  for  Gate-Array  Design:  Who  is  Really  Aiding  Whom?,"  VLSI 
Design,   vol.      II,    4th  quarter,  1981. 

[ 3 ]  Roesser,  Robert  R.  "A  Discrete  State  Space  Model  for  Linear  Image  Processing,"  IEEE 
Trans.     Aut.     Control,   Vol.     AC-20,    No.      1,    pp.      1-10,    Feb.  1975. 

[4]  Kung,  S.Y.,  B.C.  Levy,  M.  Morf,  T.  Kailath,  "New  Results  in  2D  Systems  Theory, 
Part  II;  2-D  State  Space  Models  -  Realization  and  the  Notions  of  Controllability, 
Observability  and  Minimality,"   Proc.      IEEE,   Vol.     65,    No.      6,   pp.     945-961,   June  1977. 

[5]  Stuller,  John  A.,  "Generalized  Running  Discrete  Transforms,"  IEEE  Trans .  Acoustics , 
Speech  and  Signal  Processing,   Vol.     ASSP-30,    No.      1,    pp.     60-68,    Feb.  1982. 

[6]  Kung,  H.T.  and  C.L.  Leiserson,  Chapter  8  of  Introduction  to  VLSI  Systems ,  by  C.A. 
Mead  and  L.A.     Conway,   Addison-Wesley ,  1980. 


Discussion 


References 


1982. 


SPIE  Vol.  341  Real  Time  Signal  Processing  V(1982)  /  259 


Signal  processor  architecture  performance  assessment 
for  very  high  speed  integrated  circuits  (VHSIC) 


R.  W.  Priester 

Systems  &  Measurements  Division,  Research  Triangle  Institute 
P.O.  Box  12194,  Research  Triangle  Park,  North  Carolina  27709 

Abstract 

This  paper  discusses  the  problem  of  digital  signal  processor  architecture  performance 
evaluation.     This  aspect  of  signal  processor  technology,  while  not  totally  ignored  in  the 
past,   has  not  received  the  explicit  consideration  which  it  deserves.     If  effective  use  of 
available  resources  is  to  be  achieved,   future  implementation  of  complex  VHSIC/VLSI-based 
systems  probably  will  require  increased  consideration  of  the  architecture  performance 
assessment  problem.     Three  broad  approaches  to  this  problem  are  discussed  and  a  brief 
example  of  each  is  presented.     Of  the  approaches  considered,   it  appears  at  present  that 
techniques  based  upon  computer-aided  simulation  and/or  analyses  of  signal  processor  models 
represent     the  most  promising  approach  to  this  problem.     Quantitative  figures  of  merit  for 
use  in  evaluating/comparing  signal  processor  performance  are  presented  and  briefly 
discussed.     These  are  typically  defined  in  terms  of  a  limited  number  of  selected  system 
parameters  which  might  be  of  concern  in  a  given  application.     Several  techniques  that  are  of 
interest  to  the  architecture  performance  assessment  problem  are  briefly  reviewed  and 
d  iscussed . 

1 . 0  Introduction 

Writing  in  1975,  Fuller   [1]   directed  the  following  remarks  toward  the  problem  of 
performance  analysis  of  general-purpose  computer  systems:      "Many  current  computer  systems 
are  as  complex  as  such  other  artificial  systems  as  high-performance  aircraft  or  modern 
skyscrapers.     The  discouraging  fact  is  not  the  complexity  of  computer  systems  but  that 
computer  engineers  do  not  have  the  range  of  tools  to  evaluate  performance  that  aeronautical 
or  civil  engineers  do."     Although  directed  to  third  generation  computers  and  their 
predecessors,   these  remarks  appear  to  be  applicable  today  because  performance  evaluation  of 
digital  computers  remains  as  much  an  art  as  science   [1(2]  .     Assessing  the  performance  of 
digital  computer  systems  has  received  much  study.     Some  of  the  techniques  that  have  been 
applied  are  simulation  methods,   analysis  of  queuing  models,   benchmark  testing,  and 
monitoring  of  operational  systems.     Many  of  the  simulation  methods  and  general  purpose 
software  simulator  systems  are  discussed  in   [2,3].     Queuing  theory  and  analysis  techniques 
are  discussed  by  Fuller    [1]   and  by  Lipsky  and  Church    [4],     Benchmarks  have  been  used  to 
compare  and  evaluate  operational  computer  systems  as  well  as  a  means  to  evaluate  proposed 
computer  systems  that  are  modeled  via  simulators   (i.e.,   hardware  description  languages). 

In  contrast  to  the  above,   very  little  work  which  attempts  to  develop  methodologies  for 
evaluating  the  performance  of  digital  signal  processor  architectures  has  been  reported  in 
the  open  technical  literature.     On  the  other  hand,   one  can  find  numerous  computational 
complexity  analyses  of  a  number  of  algorithms  frequently  used  to  implement  digital  signal 
processing  functions.     Most  of  these  analyses  assume  that  a  von  Neumann  type  of  processor 
architecture  will  be  used  for  algorithm  realization  or  implementation.     Until  the  advent  of 
VHSIC  and  VLSI   (and   in  a  few  instances,   LSI)   this  has  been  a  realistic  assumption.  Its 
validity  for  future  complex  digital  signal  processors  can  be  questioned  because  of  signal 
processor  architecture  changes  that  have  begun  to  appear  and  are  sure  to  accelerate  in  the 
future.     For  example,   it  seems  reasonble  to  assume  that  VHSIC  and  VLSI  approaches  to  system 
design  will  be  capable  of  providing  greatly  enhanced  computational  capabilities  of  a  general 
nature  that  will  aid  in  solving  many  present  and  future  complex  digital  signal  processing 
problems . 

There  are  two  basic  approaches  for  enhancing  speed  performance  of  a  data  processing 
system:      (i)    increase  the  speed  of  operation  of  the  processing  elements,   and   (ii)  introduce 
concurrency,  where  appropriate,   by  employing  multiple  processors  operating  in  parallel. 
VHSIC  addresses  both  of  the  above  items  and  can  be  expected  to  have  a  significant  impact. 
Thus,   system  designers  will  be  confronted  with  a  class  of  highly  complex  problems  mentioned 
in   [5].     That  is,   how  does  one  effectively  map  computational  algorithms  onto  future 
low-cost,   large-scale  integrated  processing  structures?     Also,   in  implementing  a  given 
calculation,   how  close  have  we  come  to  attaining  the  possible  concurrency  inherent  in  the 
computation?     The  goal  of  signal  processor  architecture  performance  assessment  should  be  to 
effectively  address  these  questions  and  hopefully  to  provide  quantitative  performance  data 
which  will  aid  systems  designers  and  implementers . 
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2 . 0    The  assessment  problem  and  approaches  of  attack 


Ultimately  the  performance  assessment  of  digital  signal  processing  system  architectures 
involves  questions  relating  to  effective  means  for  describing  possibly  complex  systems  and 
estimating  or  measuring  their  effectiveness  in  carrying  out  prescribed  tasks.     Given  a 
computational  requirement,   clearly  two  important  functions  are:     1)   develop  a  schedule  for 
the  events  required  to  realize  the  algorithm(s)   and  2)   determine  the  effectiveness  with 
which  the  allocated  hardware  is  used.     At  the  present  time,   a  major  contribution  to  the 
inherent  complexity  of  the  assessment  problem  is  our  limited  ability  to  effectively  deal 
analytically  with  large  amounts  of  detailed  information.*     Most  systems  capable  of 
performing  significant  tasks,   if  they  are  described  in  sufficient  detail,   ultimately  become 
large-scale  systems  problems.     There  exists  a  fundamental  tradeoff  between  the  level  of 
detail  of  the  desired  performance  information  and  the  complexity  of  representation  (or 
instrumentation)    (i.e.,   the  model)    used  in  the  problem  analysis.     If  very  precise 
information  regarding  performance  features  is  required,   then  a  complex 
f ormulation/implementation  and  data  analysis  problem  is  likely  to  result. 

Given  this  possible  confrontation  with  complexity  of  representation,    implementation,  and 
data  analysis,    there  are  at  least  three  broad  approaches  which  might  be  taken  in  order  to 
attack  performance  assessment  problems  of  systems.     These  are  briefly  summarized  as 
follows: 

1 )  Analytical : 

Formulate  a  mathematical  model  of  the  system  and  assess  its  performance  features  using 
techniques  from  applied  mathematics. 

2 )  Construct  and  test: 

Design,  construct  and  experimentally  test  a  prototype  of  the  system  under  consideration. 
Exercise  the  system  and  assess  performance  by  analysis  of  experimental  data. 

3 )  System  model  simulation  and/or  analysis: 

Develop  a  software-based  system  model  in  an  appropriate  computer  language.     The  system  is 
not  physically  constructed.     Performance  assessment  is  based  upon  the  system's  response  to 
"experimental"  conditions  as  simulated  on  a  computer.     Alternately,  a  model  of  some 
specified  processing  requirement  can  be  analyzed  using  computer-aided  methods. 

While  conceptually  attractive  and  highly  desirable,    item  1)   above  is  not  of  great 
practical  interest  now  because  it  appears  only  rarely  applicable  to  the  performance 
assessment  problem.     At  the  present  time,   computer  scientists  and  engineers  are  intensively 
pursuing  the  topic  of  computational  complexity  theory  which  is  becoming  more  viable  at  the 
algorithmic  level.     A  number  of  models  and  performance  measures  have  been  developed  and  used 
for  some  specialized  structures  and  algorithms   [6,7,8,9,10].     Furthermore,    it  is  not,  at 
present,   feasible  to  consider  this  approach  whenever  the  level  of  detail  of  system 
description  becomes  high   (except  possibly  in  the  case  of  simple  systems).     Thus,  despite  its 
potential  for  future  use,    the  analytic  approach  is  not  now  sufficiently  robust  to  form  a 
basis  for  processor  performance  assessment.     An  example  which  illustrates  application  of 
this  approach  will  be  given  in  the  next  section. 

Performance  assessment  methods  based  upn  the  construct  and  test  approach,   item  2)  above, 
have  been  used   in  the  past  primarily  for  two  reasons: 

a)  More  viable  alternatives  were  not  available  before  simulation  techniques  were 
developed . 

b)  Very  reliable  and  useful  performance  assessment  information  can  be  obtained  from  a 
carefully  implemented  experiment  design  and  test  data  analysis. 

Major  drawbacks  of  this  method  are  its  possibly  high  cost  and  lack  of  flexibility  (i.e., 
each  assessment  problem  is  treated  essentially  as  a  new  problem.) 

System  model  simulation  and/or  analysis,   item  3)    above,    is  broadly  applicable  to  digital 
systems,   it  provides  flexibility  with  respect  to  model  changes  induced  by  design  or  hardware 
changes,   and   it  can  be  more  cost  effective  than  other  approaches.     Furthermore,  its 
capability  can  be  expected  to  improve  with  the  passage  of  time   (it  seems  reasonable  to 
expect  improvements  in  1)   and  2)   also,   however).     For  these  reasons,   the  system  model 
simulation  and/or  analysis  approach  is  considered  to  be  the  most  viable,  at  present  and  in 

*Exceptions  exist  of  course,    for  example  see  [6]. 
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the  reasonably  near  future,  of  the  three  approaches  considered  here.     Nevertheless,  it 

should  be  emphasized  that  the  designated  approach  can  entail  a  great  deal  of  work.  It  can 

be  expensive  in  terms  of  both  dollars  and  time,   and  it  might  not  always  provide  all  of  the 
performance  information  available  from  other  approaches. 

3 . 0     Examples  of  approaches 

In  this  section  the  performance  assessment  approaches  introduced  above  will  be  either 
illustrated  or  discussed.  While  more  detail  would  certainly  be  desirable,  space  economy 
requires  brevity. 

3 . 1  Analytical 

The  example  chosen  for  discussion  here  makes  use  of  results  developed  from  a  complexity 
analysis  of  a  model  by  Tompa   [9].     In  his  work,  Tompa  considered  two  critical  resource 
requirements  necessary  to  perform  the  matrix  vector  product 

y  =  Ax 

where  x  and  y  are  n-vectors  and  A  is  an  n-by-n  matrix.     Bounds  on  the  required  number  of 
time  steps,  T,   as  a  function  of  available  memory  space,   S,   and  matrix  size,   n,  were 
developed  using  the  so-called  pebbling  game  model,    [9  and  references  contained  therein]. 
The  result  obtained  is 

n(n-S) 
1  -  (S+l)  b 

which  clearly  shows  that  for  fixed  n,  time  can  be  traded-off  with  memory  space,  S.  The 
preceding  result  can  be  given  many  interpretations,  but  two  of  particular  interest  are: 

(i)      Find  S  to  minimize  T.     The  result  is  S^n. 

(ii)     Find  S  to  minimize  ST.     The  result  is  ST^n/2. 

These  results  suggest  that  if  they  can  be  found,   there  exist  two  algorithms  which  minimize 
critical  resource  requirements  as  given  by  (i)   and  (ii).     This  approach  to  assessment  has 
been  applied  to  the  performance  analysis  of  linear  systolic  array  matrix-vector  processors 
[11,12]    for  which  cases   (i)   and   (ii)   above  have  been  identified  as  being  applicable.  The 
preferred  algorithm   (i.e.,    (i)   or   (ii))   depends  upon  the  structure  of  matrix  A.     Details  of 
the  comparison  are  given  in   [11,12]    and  will  not  be  repeated  here. 

3 . 2  Construct  and  Test 

No  example  will  be  given  in  this  case;   however,    it  is  important  to  note  that  in  any 
serious  undertaking  of  system  implementation,   this  assessment  method  will  ultimately  be 
applied  as  a  means  of  verifying  whether  or  not  performance  goals  have  been  achieved.  Other, 
more  economical  means  are  desired  to  provide  information  upon  which  to  base  time  critical 
design  decisions  in  the  earlier  phases  of  system  conception  and  design.     However,  this 
method  provides  the  ultimate  means  for  verifying  that  system  performance  requirements  have 
been  met. 


3 . 3     S  imula tion/Analysi  s 

The  example  chosen  for   illustrating  application  of  the  simulation/analysis  approach  was 
used  to  evaluate  and  compare  the  performance  of  a  number  of  candidate  computer  systems 
[13,14].     In  order  to  perform  the  evaluation  it  was  necessary  to  model  the  candidate 
computers,  select  a  set  of  benchmark  algorithms  for  exercising  the  models,  encode  benchmark 
programs  and  measure  computer  system  efficiency  by  evaluating  performance  under  execution  of 
the  benchmarks.     Statistical  experiment  design  methods  were  used  to  assign  the  algorithms  to 
programmer  teams  and  to  evaluate  the  simulated  results  in  order  to  rank  the  architectures  of 
the  systems  considered. 

The  Instruction  Set  Processor   (ISP),   a  hardware  description  language   (HDL),  was  used  to 
model  each  candidate  computer.     Given  these  HDL  models  of  the  computers,   assembly  language 
benchmark  programs  were  written  and  "executed"  on  the  appropriate  computer  model.  Analysis 
and  evaluation  of  test  data  collected  from  the  simulations  formed  the  basis  for  performance 
comparison. 

Several  important  points  can  be  drawn  from  the  preceding  brief  remarks.     First,   a  robust 
HDL  is  required  in  order  to  accommodate  the  possibly  broad  range  of  systems  for  which 
performance  assessment  might  be  required.     Not  only  must  candidate  systems  be  modeled,  but 
the  complete  set  of  benchmark  programs  for  each  machine  must  be  programmed  and  the 
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associated  system  response  "measured,"  Performance  comparison  is  based  upon  analysis  of  the 
measured  simulated  results.     These  are  not  trivial  tasks  to  be  lightly  undertaken. 


4 . 0     Performance  measures  for  signal  processor  assessment 

In  the  course  of  conducting  a  quantitative  assessment  of  candidate  algorithms  and/or 
signal  processor  architectures  the  need  for  performance  measures  arises.     Each  of  the 
measures  discussed   in  this  section  quantifies  a  basic  system-related  resource  or  expresses  a 
functional  relationship  between  a  number  of  such  resources.     A  central  premise  of  any 
suggested  performance  measure  is  the  availability  of  information  required  for  its 
application.     In  the  general  case,   such  information  might  be  available  from  any  one  of  the 
three  broad  assessment  approaches  presented  in  section  3.     However,  whatever  the  source  of 
data  might  be,   so  long  as  the  data  are  valid,    the  measures  can  be  applied.     Thus  the 
performance  measures  discussed  here  are  in  principle  generally  applicable.     Nevertheless,  it 
is  important  to  note  that  performance  measures  are  ultimately  selected  on  the  basis  of 
subjective  reasoning.     The  measures  presented  below  do  not  constitute  an  exhaustive  list; 
however,   location  of  additional  measures  is  a  topic  of  continuing  interest  and  pursuit. 

4 . 1  Classical  performance  measures 

The  measures  listed   in  this  subsection  have  been  applied  to  computers  and  digital  systems 
for  a  sufficiently  long  time  and  with  sufficient  frequency  to  be  classified  as  classical 
measures.     Some  of  the  most  frequently  used  classical  measures  are: 

o     Addition  time 

o     Multiplication  time 

o     Memory  cycle  time 

o     System  throughput  rate 

o     Various  data  transfer  rates  and/or  subsystem  bandwidths. 

These  measures  are  of  continuing  interest  to  performance  assessment  because   (i)  they 
indicate  the  state  of  various  technologies  involved  in  system  realization  and   (ii)  they 
succinctly  quantify  important  information  sometimes  needed  for  assessing  algorithm  execution 
time.     However,   if  one  wishes  to  evaluate  system  performance  in  terms  of  structural 
efficiency  (without  regard  to  current  technology)   these  measures  lose  some  of  their  appeal. 

4 . 2  Performance  measures  of  the  computer  family  architecture  (CFA) 

The  three  performance  measures  presented  here  were  used  in  a  study  which  compared  several 
von  Neumann  computer  architectures   [13,14]   mentioned  in  3.3,   above.     They  are  noteworthy 
since  they  enable  architectures  to  be  compared  independent  of  the  technolgoy  base  used  for 
their  implementation  and  because  they  are  applicable  to  programmable  systems.  These 
measures  are  defined  as  follows: 

Pes ignation  Description  of  measure 

S-measure  Test  program  (benchmark)  size 

M-measure  Memory  activity  -  the  number  of  bytes  read  from  or  written 

to  memory  during  execution  time  of  test  program 

R-measure  Number  of  processor  cycles  required  for  the  execution  of  a 

specific  test  program  [see  14], 

4 . 3  Resource  requirements  and  efficiency  of  utilization 

Performance  measures  which  quantify  resource  requirements  and  the  efficiency  with  which 
they  are  used  in  a  given  digital  processor  are  of  interest.     Two  resources  of  frequent 
interest  are  time,  T,   and  generalized  space,   S   (e.g.,  memory  requirements,   number  of 
processors,   or  number  of  instructions  required  to  code  an  algorithm).     Time  measure,  T, 
generally  represents  the  number  of  time  steps  required  to  evaluate  an  algorithm. 

In  the  interest  of  efficient  utilization  of  functional  subsystems,   consider  the 
definition 

v  _  Number  of  process  time  steps  that  subsystem  performs  useful  work 
Total  number  of  time  steps  required  to  complete  the  process 

Using  the  above  definitions  of  S,  T,  and  ij  ,   consider  the  combined  figure  of  merit 

F  =  V(ST)  . 

The  above  relation  for  F  expresses  the  desire  to  maximize   v  while  minimizing   (ST) .  For 

example,   if  two  (denoted  by  subscripts  1  and  2)   different  algorithms  or  digital  processors 
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(or  combinations  of  both)   are  specified  for  a  given  computational  requirement,   then  a 
comparison  can  be  made  on  the  basis  of 

Q  =  F1/F2  . 

Examples  of  the  use  of  j?  ,    (ST),   F  and  Q  in  connection  with  systolic  array  processors  can  be 
found  in   [11]  . 

4  . 4     Other  measures 

Additional  performance  measures  have  been  discussed  by  Sameh  and  Kuck  [15]  who  provide 
definitions  appropriate  to  the  use  of  parallel  processors.  Furthermore  D'Hollander,  [16] 
defines  and  illustrates  a  number  of  interesting  performance  measures  as  does  Han  [17]. 

Measures  of  performance  represent  an  important,   integral  part  of  any  serious  assessment 
problem.     The  measures  mentioned  above  are  representative  of  those  available  that  might  be 
considered  for  inclusion  into  an  assessment  methodology.     It  is  necessary  to  consider  a 
number  of  performance  measures  since  their  subjective  nature,  coupled  with  our  inability  to 
simultaneously  consider  a  large  number  of  important  problem  parameters,  precludes  the 
identification  of  the  "ultimate"   figure  of  merit. 

5 . 0     Performance  assessment  techniques 

This  section  summarizes  several  techniques  that  can  be  applied  to  the  digital  signal 
processor  performance  assessment  problem.     HDLs  are  briefly  discussed  since  they  fall  into 
the  model  simulation  approach  discussed  in  Section  3. 

While  not  strictly  simulation  methods,   techniques  which  use  computer-aided  analysis  of 
network  models  are  also  discussed  in  this  section.     These  models  are  developed  from  a  graph 
representation  of  some  specified  processing  requirement.     Analysis  of  the  associated  network 
model  enables  one  to  assess  performance  of  the  signal  processor  elements. 

5 . 1  Hardware  description  languages  (HDLs) 

Digital  hardware  designers  use  HDLs  as  a  means  of  describing  systems  that  are  of 
interest.     These  languages  are  used  in  ways  analogous  to  those  of  high  level  languages  in 
the  course  of  algorithm  implementation  on  a  general  purpose  computer.     Just  as  a  high  level 
computer  language  provides  aids  for  the  programmer,  HDLs  ideally  should  provide  the  user 
with  the  capability  for   (i)   precise  yet  concise  system  description,    (ii)   digital  system 
documentation  (test  generation,  user's  manuals,  etc.),    (iii)  computer  simulation  of  the 
described  system  with  the  capability  for  gathering  important  "operational"  data  as  the 
simulation  proceeds,  and   (iv)   incorporation  of  hardware  design  changes.     In  connection  with 

(i)  above,   the  hierarchical  level  at  which  an  HDL  can  describe  a  system  is  very  important. 
A  digital  system  can  be  described  at  the  following  levels  of  detail:      (i)  algorithm, 

(ii)  processor  memory  switches  (PMS),  (iii)  instruction  level,  (iv)  register  transfer  level 
(RTL),    (v)   logical  variable  level,   and   (vi)   gate   (i.e.,   circuit)  level. 

References   [18]    and    [19]   provide  tutorial  and  survey  information  on  HDLs.     Numerous  HDLs 
have  been  proposed  and  a  number  have  been  implemented  and  used   in  the  analysis  and  design  of 
digital  systems.     From  this  group,   five  HDLs  were  identified  as  being  potentially  useful  in 
evaluating  the  architectural  performance  of  digital  signal  processors.     Selection  was  made 
on  the  basis  of  the  following  questions:      (i)    at  what'  level   is  the  system  described?,  (ii) 
can  synchronous  and  asynchronous  actions  be  handled?,    (iii)  what  are  the  timing 
capabilities?     (iv)     can  the  HDL  description  be  simulated?,   and   (v)   does  the  language 
provide  hierarchical  module  description?     This  comparison  resulted  in  the  selection  of  two 
HDLs,   ISPS   (Instruction  Set  Processor  Specification)   and  SARA  (Systems  Architect  Assistant), 
as  being  the  strongest  available  candidates  for  use  in  the  signal  processor  architecture 
performance  assessment  problem.     These  two  HDLs  are  currently  undergiong  evaluation.  A 
proposed  new  HDL,  VHDL   (VHSIC  HDL),   has  grown  out  of  the  VHSIC  project.     This  language  is 
specified  in   [20]   which  also  contains  a  description  of  an  alternative  to  VHDL   [20,  Appendix 
A]  . 

5 . 2  Analysis  of  network  models 

This  approach  to  signal  processor  architecture  assessment  finds  its  origin  in  techniques 
developed  and  used  in  project  management  to  allocate  resources  and  schedule  operations  in 
order  to  effect  "project"   implementation.     Signal  processing  algorithms  are  considered  as 
projects  to  be  implemented.     A   directed  acyclic    graph  models  the  project  considered.  Edges 
of  this  graph  represent  primitive  algorithmic  operations  which  are  performed  in  a  prescribed 
manner.     They  also  serve  to  express  the  precedence  relations  contained  in  the  project 
schedule.     Vertices  of  the  graph  represent  project  events  or  project  milestones.  Assigning 
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time  duration  weights  to  the  edges  (associated  with  the  time  requirements  of  the  primitive 
operations)   of  the  graph  results  in  a  project  activity  network  representation  [21]. 
Operations  reseachers  have  applied  activity  network  analysis  techniques  to  study  the 
tradeoffs  between  project  time  requirements  and  resource  requirements.     Such  analyses 
provide  data  which  can  be  used  to  assess  important  aspects  of  the  implementation. 

An  example  of  a  simple  activity  network  is  presented  in  Figure  1.  Primitive  operations 
are  denoted  by  upper  case  letters  (their  associated  time  requirements  are  not  shown  in  this 
example) .  A  convention  of  this  representation  enables  the  expression  of  precedence 
constraints.  That  is,  no  activity  emanating  from  a  given  vertex  can  be  initiated  until  all 
activities  terminating  at  that  vertex  are  completed.  Thus  the  "project  completed"  event  of 
Figure  1  must  be  preceded  by  all  activities  of  the  network.  In  Figure  1,  action  A  precedes 
actions  E  and  D,  etc. 

Analysis  of  an  activity  network  provides  earliest  and  latest  event  times  corresponding  to 
the  vertices  contained  in  the  network  model.     The  earliest  event  time  of  the  terminal  vertex 
of  the  network  is  called  the  Critical  Path  (CP)   time  -  it  is  the  longest  directed  path  from 
the  beginning  vertex   (measured  in  units  of  time  required  by  the  activities  located  on  the 
CP).     This  path  is  important  because  under  the  problem  specification  no  path  from  beginning 
vertex  to  terminal  vertex  requiring  less  time  can  be  found. 

Brafman,   et  al.    [22],   recognizing  that  the  signal  flow  graph  representation  of  a  digital 
filter  can  be  given  an  activity  network  interpretation,  show  how  the  CP  method  can  be  used 
to  implement  digital  filters  using  multple  microprocessors.     Their  approach  has  been 
generalized  by  Zeman  and  Moschytz    [23,24]   who,    in  addition  to  finding  CP,   also  determine  the 
latest  event  times  of  the  various  network  nodes.     Given  this  information,    it  is  possible  to 
evalutate  the  resource  tradeoffs  available  when  implementing  a  given  network  structure. 
Both  scheduling  and  resource  requirements  properties  of  a  given  problem  can  be  studied  and 
the  alternatives  evaluated.     Work  of  a  somewhat  similar  nature  has  been  presented  by  Huang 
and  Wing  [25], 

As  an  example  of  this  approach,  consider  the  second  order  synchronous  digital  filter  of 
Figure  2  which  has  been  analyzed  in    [23].     Converting  to  an  activity  network  representation 
and  assuming  that  only  multiplication  and  addition  operations  consume  time   (assume  five 
units  of  time  each)    it  is  easy  to  show  that  the  resource  tradeoffs  plotted  in  Figure  3 
follow.     Figure  3a  illustrates  the  case  where  maximum  parallelism  is  obtained   (it  is  easy  to 
show  that  for  the  example  considered,   this  result  is  identical  with  that  obtained  by 
applying  Chrochiere's  node  partioning  in  order  to  maximize  parallelism   [26]).     It  is 
possible  to  reduce  the  hardware  commitment  to  this  problem  (See  Figure  3b)  without 
increasing  the  CP  time  of  Figure  3a.     Only  when  one  adder  and  one  multiplier  are  used  is  the 
CP  time  slightly  increased  as  shown  in  Figure  3c.     Clearly,   a  number  of  the  performance 
metrics  introduced  earlier  can  be  used  to  quantify  the  performance  of  this  digital 
processor . 

Performance  assessment  of  digital  signal  processors  using  the  activity  network  approach 
discussed  above  is  of  interest  for  a  number  of  reasons: 

(i)     Hardware  resource  requirements  of  algorithms  can  be  quantified  and  compared 
using  performance  measures. 

(ii)     Required  computations  of  specified  algorithms  can  be  scheduled  and  even 
microcoded  for  some  signal  processor  structures  [24,25]. 

(iii)     Failures  leading  to  degraded  modes  of  operation  can  be  investigated.  Their 
effects  and  the  desired  corrective  actions  can  be  studied. 

The  preceding  simple  example  illustrates  some  of  the   ideas  used  in  the  "analysis  of 
activity  networks"  approach  to  the  assessment  problem.     Both  the  analysis  approach  and  the 
results  obtained  are  very  germane  to  the  signal  processor  architecture  assessment  problem. 
Typically,   signal  processor  algorithms  are  data  flow  and  thus  can  be  described  by  directed 
acyclic  graphs.     Furthermore,   the  directed  graph  representation  of  an  algorithym  can  be 
optimized  using  Petri  net  reachability  analysis  methods    [27]   before  being  subjected  to 
activity  network  analysis  as  described  above. 

A  few  comments  pertaining  to  some  of  the  problems  encountered  in  analyzing  activity 
networks  are  in  order.     Locating  the  critical  path  is  not  overly  time  consuming,  even  for  a 
large  network.     Even   [28]   states  that  the  time  conplexity  is  0 ( |  E  |)  where  |  E  |  is  the  number 
of  activities.     This  is  in  contrast  to  the  analysis  problem  where  project  planning  must  be 
done  under  the  constraint  of  limited  resources.     This  is  known  to  be  NP-hard  [29]. 
Therefore,   heuristic  methods  are  frequently  used. 
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6.0  Conclusions 


This  paper  has  considered  several  general  approaches  to  the  digital  signal  processor 
architecture  assessment  problem.     Simulation  methods  and/or  analysis  of  models  based  upon 
activity  networks  appear  to  be  viable  approaches  when  compared  with  the  other  two  methods. 
Analytical  computational  complexity  approaches  to  the  assessment  problem  are  presently  under 
intense  investigation  by  many  researchers.     Because  of  their  potential   impact,   results  from 
such  research  efforts  are  worthy  of  continuous  monitoring  by  those  interested  in  the 
assessment  problem. 

While  the  thrust  of  this  paper  is  directed  only  to  those  computations  that  have 
traditionally  been  used  in  signal  processing  applications,   there  is  reason  to  question  this 
assumption  of  traditionality  with  regard  to  future  signal  processing  applications.  The 
impact  of  VHSIC  and  VLSI  on  the  availability  of  computing  resources  might  cause  dramatic 
changes  with  respect  to  our  concept  of  traditional  signal  processing  and  result  in  the 
acquisition  of  many  computations  of  more  general  nature.     Such  a  trend  would  only  compound 
the  performance  assessment  problem  and  it  underscores  the  need  for  progress  on  this  aspect 
of  the  signal  processing  problem. 
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Figure  2.     A  Second  Order  Digital  Filter. 
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Figure  3.     Arithmetic  Operations  Versus 
Time  Options  for  Implementing 
the  Digital  Filter  of  Figure 
2.      (*  =  multiplication, 
+  =  addition) . 
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Abstract 

In  this  paper  we  briefly  describe  the  systolic  array  architecture.     We  discuss  perform- 
ance issues  that  arise  in  the  evaluation  of  systolic  array  architectures.     We  review  the 
fundamental  concepts  of  Petri  nets  and  consider  their  suitability  as  a  tool  for  the  modeling 
and  analysis  of  systolic  array  architectures.     We  review  known  results  concerning  the  use  of 
timed  decision-free  Petri  nets  for  performance  evaluation  of  computing  systems.     We  propose 
a  new  class  of  Petri  nets  (called  coherent  safety  nets)   that  appear  to  be  useful  for 
performance  evaluation  of  pipelined  signal  processing  architectures.     These  techniques  are 
applied  to  systolic  array  architectures. 

1 . 0  Introduction 

Signal  processor  architectures  have  benefited  and  will  continue  to  benefit  from  improve- 
ments in  implementation  technologies.     Component  densities  of  integrated  circuits  have 
doubled  approximately  every  two  years.     This  has  spawned  no  less  than  a  revolution  in 
digital  systems  design.     Advances   in  VHSIC  and  VLSI  technologies  have  made  possible  the 
consideration  of  new  architectures  based  upon  the  concurrent  or  parallel  use  of  multiple 
hardware  units.     These  advances  have  also  made  the  construction  of  special  purpose 
processors  an  economically  attractive  option. 

Improvements  (in  the  form  of  physical  integration)    in  implementation  technologies  will 
have  several  profound  effects  on  signal  processor  architecture.     The  increase  in  circuit 
density  results  in  faster  components  which  may  in  turn  result  in  faster  systems.  However, 
the  technologies  in  which  digital  systems  are  implemented  have  demonstrated  relatively  flat 
growth  in  basic  device  speed  when  compared  to  drops  in  the  basic  device  cost  and  size.  The 
real  payoff  of  increasing  scales  of  integration,   therefore,  has  been  the  dramatic  decrease 
in  cost,   size,   and  power  dissipation  of  the  basic  processing  element.     Thus  signal  processor 
architects  can  design  systems  constructed  from  large  numbers  of  processors.     The  motivation 
for  this  is  to  achieve  algorithmic  speedup  by  employing  a  number  of  processors  to  execute  a 
task  in  parallel.     Moreover,   the  size  and  power  improvements  open  up  opportunities  for  new 
applications  of  signal  processing. 

Increased  processing  power  per  dollar  provides  the  signal  processor  architect  with  a 
relative  abundance  of  processors  at  the  system  level.     Whereas  single  processor  systems 
constituted  the  only  economically  feasible  region  of  the  design  space  in  von  Neumann's  day, 
today  systems  are  being  proposed  that  are  predicated  on  the  use  of  thousands  of  processors. 
Systems  based  on  the  use  of  only  a  few  processors  relied  on  a  regime  of  centralized  control. 
Control  and  coordination  of  large  numbers  of  processors  rely  on  some  form  of  cooperative 
anarchy,  known  variously  as  distributed  or  highly  parallel  processing.     The  question  of  how 
to  effectively  utilize  a  large  number  of  processors  in  solving  a  common  problem  is  now 
prominent  in  digital  systems  design. 

The  complexity  of  systems  based  on  the  concurrent,   cooperating  operation  of  a  large 
number  of  processors  highlights  the  necessity  for  acquiring  tools  for  the  analytical 
modeling,   analysis,   and  evaluation  of  these  systems.     In  this  paper  we  will  discuss  the  use 
of  certain  classes  of  Petri  nets  for  the  analysis  of  signal  processor  architectures  based  on 
highly  parallel  systolic  array  organizations. 

We  shall  briefly  describe  the  systolic  array  architecture.     Then  we  will  discuss 
fundamental  concepts  of  Petri  net  modeling,   particularly  as  related  to  the  performance 
analysis  of  systolic  arrays.     We  conclude  the  paper  with  a  discussion  of  the  role  of  Petri 
net  modeling   in  signal  processing  systems. 

2 . 0     Systolic  Array  Architectures 

The  introduction  of  the  systolic  array  for  performing  matrix  and  signal  processing  com- 
putations was  largely  motivated  by  the  desire  to  promote  architectures  that  conform  well  to 
constraints  imposed  by  VLSI  technologies   [1] .     The  systolic  array  is  a  special  purpose 
architecture  capable  of  executing  a  large  class  of  signal  processing  and  matrix  algorithms. 
The  systolic  array  is  a  regular,  expandable,   interconnected  array  of  very  simple  processors. 
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Each  processor  (or  cell)   is  connected  with  a  small  number  of  neighboring  cells,  and  all  data 
come  from  or  go  to  its  neighbors.     As  originally  defined,  the  systolic  array  operates  in  a 
systolic  mode,  which  means  that  each  cell  performs  its  preprogrammed  function,  passes  its 
results  to  its  data  consuming  neighbors  and  does  not  perform  the  function  again  until  it 
receives  input  from  its  data  producing  neighbors.     Thus  each  stage  of  the  systolic  array's 
operation  is  completely  determined  by  the  state  of  a  local  neighborhood  rather  than  being 
dependent  on  global  state  information.     When  the  cells  are  all  synchronized,   this  data  flow 
pattern  "pumps"  data  through  the  array  —  hence  the  name  systolic  array. 

Another  important  feature  of  the  systolic  array  concept  is  the  innovative  use  of  pipelin- 
ing. Pipelining  has  been  used  in  numerous  architectures  as  a  means  of  improving  performance. 
It   is  particularly  fruitful   in  signal  processing,  where  high  throughput  is  required.  The 
pipeline  is  analogous  to  an  assembly  line:     a  problem  is  broken  up  into  a  number  of  subtasks 
to  be  sequentially  performed,   and  these  subtasks  are  concurrently  performed  on  a  number  of 
problems.     When  a  subtask  is  completed  for  a  given  problem,   the  problem  is  passed  on  for  the 
succeeding  subtask  to  be  performed  and  the  problem  from  the  preceding  subtask  is  accepted. 
This  allows  many  problems  to  be  solved  simultaneously,    thus  achieving  a  high  throughput 
rate.     The  insight  of   [1]   was  to  realize  that  certain  problems  are  not  only  well  suited  to 
pipelining,  but  to  multipipelining ,   i.e.,  the  use  of  several  intersecting  pipelines  that 
insure  that  the  data  of  the  problem  automatically  and  simulaneously  arrive  at  the  subtask 
where  they  can  be  used.     This  clever  set-up  allows  all  data  and  results  to  be  at  the  right 
place  at  the  right  time.     Another  important  distinction  between  systolic  and  conventional 
pipelines  is  that  systolic  architectures  pipeline  both  (partial)   results  and  input  data, 
whereas  conventional  architectures  pipeline  only   (partial)  results. 

Systolic  arrays  have  been  proposed  to  solve  a  number  of  problems:     matrix-vector  multi- 
plication, matrix-matrix  multiplication,   LU  decomposition  [1],  convolution,  FIR  and  IIR 
filtering,   Fourier  transform   [2],   least  squares  problems    [3],   and  even  graph  problems  [4]. 
The  systolic  array  allows  one  to  achieve  considerable  speedup.     For  the  problems  it  has  been 
adapted  to  it  typically  achieves  from  linear  to  quadratic  speedup  in  execution  time  over 
conventional,   sequential  algorithms  for  these  problems. 

Another  advantage  of  the  systolic  array  is  that  it  appears  to  be  extremely  well  suited  to 
VHSIC  or  VLSI  implementation.     The  fact  that  its  basic  cells  are  very  simple  means  that  it 
would  be  straightforward  to  design  the  basic  cell  and  automatically  replicate  it  across  the 
chip.     The  planar  topologies  of  the  systolic  array  (see  Figure  1)   are  also  straightforward 
to  lay  out  in  silicon.     The  fact  that  cells  are  connected  in  a  nearest-neighbor  pattern  is 
also  propitious  for  VHSIC  or  VLSI  implementaton,   since  this  allows  simple  routing  of 
communication  wires.     This  last  point  is  significant  in  light  of  the  observed  growth  of 
communications  related  costs  in  integrated  circuit  design  [5]. 

3 . 0     Petri  Net  Analysis  of  Systolic  Architectures 

Petri  nets  were  originally  introduced  for  the  study  of  asynchronous  systems   [6],     A  Petri 
net  is  an  abstract  mathematical  entity,  and  as  such  can  be  used  to  formally  model  certain 
attributes  of  real-life  systems.     The  theory  of  Petri  nets  has  been  developed  extensively  so 
that  properties  of  certain  classes  of  Petri  nets  are  well-understood.     It  is  therefore 
possible  to  use  Petri  nets  as  a  tool  for  the  formal  mathematical  analysis  of  system 
behavior.     Unlike  many  classes  of  abstract  automata,   however,   Petri  nets  have  been  used  to 
model  and   investigate  real  machines,   e.g.,   the  CDC  6400    [7],   the  CDC  6600    [8],   and  several 
generic  multiprocessing  systems   [9].     An  extension  of  the  Petri  net,  the  UCLA  graph,  has 
even  been  used  as  the  basis  for  the  SARA  digital  systems  design  methodology  [10], 

A  Petri  net  is  a  special  type  of  directed  graph  in  which  the  vertices  are  of  two  types: 
places  and  transitions .     Graphically,  places  are  represented  by  circles  and  transitions  by 
bars.     The  arcs  of  a  Petri  net  may  go  from  places  to  transitions  or  from  transitions  to 
places.     If  an  arc  goes  from  a  place  to  a  transition,   then  that  place  is  called  an  input 
place  of  the  transition.     If  an  arc  goes  from  a  transition  to  a  place,   then  the  place  is 
called  an  output  place  of  the  transition. 

In  order  to  study  the  behavior  of  a  system  it  is  necessary  to  consider  how  its  Petri  net 
makes  transitions  from  one  state  to  the  next.     For  reasons  that  will  become  apparent  later, 
we  wish  to  associate  times  with  each  transition  of  a  Petri  net.     To  each  transition  ti  of 
a  Petri  net  we  assign  a  random  variable        with  expected  value  E[Xi]    =  Ti,  which  is 
intended  to  describe  the  time  required  to  make  that  transition.     Such  a  Petri  net  is  termed 
a  timed  Petri  net.     Transitions  are  controlled  by  markings.     A  marking  of  a  Petri  net  is  an 
assignment  of  a  non-negative  integer  to  each  of  its  places.     In  a  graphical  representation  a 
marking  of  the  Petri  net   is  indicated  by  placing  dots  or  tokens  in  the  places  of  the  net. 
Transition  ti  is  said  to  be  enabled  to  fire  when  each  of  its  input  places  contains  at 
least  one  (uncommitted)   token^     When  this  is  the  case,   transition  ti  stays  enabled  (or 
fires)   for  Ti  time  units  and  its  enabling  tokens  stay  committed  to  ti  for  this  time 
period,  whereafter  each  of  these  tokens  is  removed  from  ti's  input  places  and  one  token  is 
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placed   in  each  of   t^'s  output  places.     Once  a  token  has  enabled  a  transition   it  stays 
committed  to  that  transition  until  the  transition  time  expires,  and   it  cannot  enable  another 
transition  while  it  is  committed.     Moreover,  a  transition  cannot  be  multiply  enabled,  i.e. 
at  any  instant  at  most  one  token  from  a  transition's   input  place  can  be  committed  to  that 
transition.     To  illustrate  these  concepts  consider  the  example  shown  in  Figure  2.  The 
places  of  this  Petri  net  are  labeled  p^  through  P4  and  the  transition  have  firing  times 
Ti  =  2,   T2  =  1  and  T3  =  3.     Transition  t±  has  input  places  Pi  and  P2  and  output 
place  pi .     Similarly,   to  has  P2  as  an  input  place  and  P3  and  P4  as  output  places. 
Places  P3  and  p^  are  tj  s  input  places  and  p^  and  P2  its  output  places.  Let 
P  =   (Pi,  P2>  P3 1  p4 )   be  a  vector  of  numbers  corresponding  to  the  number  of   tokens  in 
each  place  of  the  net.     Initially  P  =   (1,   2,   0,   0)  with  transitions  t^  and  t2 
simultaneously  enabled.     At  time  1   the  state   is  P  =   (1,    1,   1,   1)   with  transitions   t^  and 
t3  enabled.     At  time  2,   P  =   (1,   0,   1,   1)   and  at  time  4,   P  =   (1,   1,   0,   0).     At  time  4,  there 
is  a  conflict:     either  transition  t^  or  t2  may  be  enabled,   but  not  both.     If   t^  is 
enabled,    then  at  time  6  the  net  is   in  state  P  =    (2,   0,   0,   0)   and  remains  there. 

Not  all  timed  Petri  nets  operate  as  above.     In  the  scheme  we  have  described  all  uncommit- 
ted tokens  form  a  first-come-first-served  queue  at  a  place.     A  transition  can  be  enabled  to 
fire  by  at  most  one  single  token  per  place.     Not  all  interpretations  of  timed  Petri  nets 
make  this  assumption.     In   [11]   two  or  more  tokens  at  a  single  place  can  enable  a  transition. 
This  means  that  a  transition  can  have  overlapped  firings,  which   is  somewhat  similar  to 
pipelining  of  the  transition.     In  this  paper  we  intend  to  model  systems  at  a  very  low  level. 
Assuming  that  transitions  represent  low  level  computational  elements,    it   is  hard  to  justify 
the  use  of  overlapped  firings.     Therefore  we  will  concentrate  on  the  non-overlapped  firing 
interpretation. 

The  above  description  of  Petri  nets  and  their  operation  is  somewhat  abstract.     It  is 
therefore  necessary  to  relate  Petri  net  concepts  to  physical  systems.     A  Petri  net  repre- 
sents a  system  of   interconnected  computational  elements.     Computational  elements  can  be 
viewed  as  hardware  elements  such  as  adders,  multipliers,  or  comparators,   and  they  are  repre- 
sented by  transitions  in  a  Petri  net.     Computational  elements  operate  on   input  operands  and 
produce  results.     The  time  associated  with  a  transition  represents  the  amount  of  time  that 
the  computation  requires.     As  we  mentioned  earlier,   times  associated  with  transitions  are 
random  variables.     This  was  done  to  capture  the  tendency  of  hardware  to  exhibit  variability 
in  performance.     For  example,  the  time  taken  by  an  operation  can  be  influenced  by  its 
operands   (e.g.,  magnitude)  or  fluctuations  in  hardware   (e.g.,  clock  drift).     Places  and 
tokens  of  the  Petri  net  represent  conditions.     For  instance,   the  presence  of  tokens  in  the 
input  places  of  the  Petri  net  fragment  in  Figure  3   indicate  that  the   input  latches  of  the 
multiplier  are  full  and  the  output  buffer  is  empty;   the  multiplier  is  then  ready  to  begin 
computation. 

Much  of  the  power  of  Petri  nets  comes  from  their  ability  to  realistically  model  parallel- 
ism,  contention,   and  sequencing.     Contention  in  a  Petri  net  occurs  when  two  transitions 
compete  for  the  same  token.     Contention  in  pipelined  systems  does  not  appear  to  present 
problems,  so  we  will  make  simplifying  assumptions  that  eliminate  the  occurrence  of  conten- 
tion in  the  Petri  nets  that  we  discuss.     We  will  exclusively  consider  a  class  of  Petri  nets 
known  as  strongly  connected  decision-free  Petri  nets.     A  Petri  net  is  strongly  connected  if 
there  is  a  sequence  of  places  and  transitions  PitiP2t2    ■    •    •   Pk-ltk-l^)k  wnere  tne  P* s  are 
places  and  the  t's  are  transitions  and  t^  has  input  place  p^  and  output  place  p^+i 
for  l£i<k.     A  Petri  net  is  decision-free  if  each  place  has  exactly  one   incoming  arc  and  one 
outgoing  arc.     Decision-free  Petri  nets  are  related  to  another  well  known  graphical  modeling 
tool:     marked  graphs,  which  are  directed  graphs  that  carry  tokens  on  their  arcs.  A 
decision-free  Petri  net  can  be  converted  to  a  marked  graph  by  transforming  the  Petri  net's 
transitions  to  nodes  and   its  places  to  arcs.     A  marked  decision-free  Petri  net  and  its 
corresponding  marked  graph  are  shown  in  Figure  4.     An   important  property  of  decision-free 
Petri  nets   is  that  a  marking  will  either  terminate   (i.e.,   eventually  go  into  some  final 
state)   or  else  it  will  end  up  in  a  cyclically  repeating  sequence  of  states.     In  this  latter 
case  the  net  is  said  to  have  a  non- terminating  marking. 

A  widely  used  performance  measure  for  signal  processing  is  throughput.     Throughput  in  a 
pipelined  system  is  reflected   in  how  frequently  the  system  can  produce  results.  For 
instance,   the  throughput  of  an  automotive  assembly  line   is  given  by  the  number  of  cars 
produced  per  day.     We  are  not  specifically  concerned  with  pipeline  startup   (latency)  time, 
although  this  can  have  an  impact  on  the  performance  of  some  systems;    the  penalty  of  startup 
time  can  be  amortized  over  long  pipelined  computations.     In  order  to  evaluate  throughput  in 
a  system  modeled  by  a  Petri  net  we  consider  the  concept  of  transition  cycle  time   [11].  Let 
[n)   be  the  time  at  which  transition         initiates   its  n-th  firing;   the  cycle  time  of 

S.  (n) 

C  .   =  lim  —  . 

n->  00  n 
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The  cycle  time         is  the  average  interfiring  time  for  transition  t^.     It  is  inversely 
related  to  the  rate  at  which  transition  t±  fires.     In  our  Petri  net  model  (non-overlapped 
transition  firing)    the  cycle  time  for  transition  t^  must  be  greater  than  or  equal  to  T^, 
the  expected  firing  time  of  t^.     In  the  Petri  net  model  of   [11]    (overlapped  transition 
firing)    the  cycle  time  of  a  transition  can  be  less  than  the  transition's  firing  time  but 
must  be  greater  than  zero.     The  quantity  T^/C^  represents  the  utilization  of  transition 
t^   (the  fraction  of  time  that  the  transition  is  enabled  to  fire). 

Another  performance  measure  that  is  related  to  throughput  is  bottleneck.     A  bottleneck  in 
a  Petri  net  (or  the  system  it  models)    is  the  maximally  utilized  transition  of  the  net. 
Identification  of  system  bottlenecks  can  be  an  arduous  task.     Fortunately  this  is  not  so  for 
systems  modeled  by  decision-free  Petri  nets  --  the  system  bottleneck  is  the  transition  with 
the  highest  expected  execution  time.     This  property  of  decision-free  Petri  nets  is  expressed 
in  the  following  theorem  [11], 

Theorem :     In  a  decision-free  Petri  net   (under  a  non- terminat ing  marking)   with  transitions 
fcl '   fc2 '    ■••»   tn  let  ci  denote  the  cycle  time  of  transition  t-^.     Then         =  Cj 
for  1  <_  i  ,   j  _<  n. 

Thus  in  a  decision-free  Petri  net  (with  a  given  non- terminating  marking)   all  transitions 
have  the  same  cycle  time  C.     This  result  is  significant  and  rather  counter-intuitive  when 
interpreted  in  the  context  of  pipelined  systems.     Recall  the  hexagonally  connected  systolic 
array  of  Figure  1.     The  original  description  of  this  array  was   in  terms  of  basic  cells  whose 
operation  was  entirely  synchronized.     All  cells  take  a  fixed  amount  of  time  to  compute  and 
communicate  their  results.     Synchronized  control  works  fine  when  all  basic  cells  have 
similar  performance.     However,   if  some  of  the  basic  cells  are  substantially  slower  than 
others,   then  the  slow  cells  constitute  system  bottlenecks  because  the  entire  array  must  be 
synchronized  to  the  speed  of  the  slowest  cell.     A  VHSIC  or  VLSI   implementation  of  the 
systolic  array  would  have  several  basic  cells  on  a  single  chip,  with  the  system  further 
composed  of  additional  chips  and  boards.     The  basic  cells  are  ot   two  types:  peripheral 
cells  and   internal  cells.     Peripheral  cells  are  located  at  the  boundary  of  a  module  (e.g., 
the  periphery  of  a  chip) ,  and  internal  cells  are  located  entirely  in  the  interior  of  a 
module.     Peripheral  cells  tend  to  be  bottlenecks.     For  example,   cells  at  a  chip's  periphery 
incur  a  much  greater  delay  in  driving  a  signal  to  its  neighboring  cells  on  another  chip  than 
do  cells  that  communicate  only  with  cells  on  the  same  chip.     This  interchip  communication 
delay  can  dominate  intrachip  communication  delay  by  an  order  of  magnitude.     Another  prime 
candidate  for  bottleneck  is  the  cells  that  interface  directly  with  input  or  output  devices. 
These  cells  are  limited  by  input  or  output  device  speed  or  by  memory  cycle  time.  Asynchro- 
nously controlled  pipelining  has  been  suggested  as  a  method  for  circumventing  the  bottle- 
necks of  a  synchronously  controlled  pipeline   [12,13].     The  Petri  net  fragment  of  Figure  5 
models  a  hexagonally  connected  asynchronous  systolic  array.     The  whole  Petri  net  is 
decision-free,  and  thus  by  the  above  theorem  all  transitions  have  the  same  cycle  time. 
Except  for  a  slight  advantage  in  pipe  startup  time  the  asynchronous  systolic  array  will 
perform  no  faster  than  the  synchronous  systolic  array  --  both  are  limited  by  the  speed  of 
the  slowest  basic  cell.     In  this  comparison  we  are  assuming  that  the  synchronization  scheme 
is  capable  of  making  all  cells  accurately  cycle  at  the  rate  of  the  slowest  cell.     We  are 
also  assuming  that  the  speed  of  a  basic  cell   (i.e.,   computation  and  communication  delay)  is 
the  same  in  both  synchronous  and  asynchronous   implementations.     Issues  such  as  the  relia- 
bility of  a  global  clocking  scheme,   the  accuracy  of  a  global  clock,  and  the  overhead 
introduced  by  either  the  synchronous  or  asynchronous  system  have  been  ignored  by  this 
analysis  —  these  issues  are  beyond  our  scope. 

The  previous  theorem  tells  us  that  eventually  all  transitions  arrive  at  a  common  basal 
cycle  time,   but  gives  no  indication  of  what  that  cycle  time   is.      It  would  be  much  more 
useful  to  know  the  exact  cycle  time  of  the  system  being  modeled.     It  is  possible  to 
calculate  the  exact  cycle  time  of  a  system  solely  on  the  basis  of   its  Petri  net's  structural 
characteristics.     This  result  is  the  following  theorem,   essentially  due  to  [11]. 

Theorem;     In  a  decision-free  Petri  net   (under  a  non-terminating  marking)   that  has 
transitions  t]_ ,   t2 ,  tn  with  firing  times  T^ ,  T2 ,    .  .  .  ,  Tn  and  places  Pi , 

P2 '    •••»  Pm  with  initial  token  counts  P^ ,  P2 ,  Pm  the  cycle  time  C   is  given 

by 


C  = 


max/T 


1  <  i  <  n 


,   1  <  k  <  q 


where 


=  sum  of  the  execution  times  of  the  transitions   in  circuit  k 
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=  sum  of  the  tokens  in  the  places  in  circuit  k 


PiCLk 


q  =  number  of  circuits   in  the  net 
=  loop  (circuit)  k. 

This  theorem  allows  one  to  directly  calculate  system  performance  by  enumerating  the 
performance  quotient   (transition  times  divided  by  token  loading)   of  each  circuit  and  choos- 
ing the  maximum.     Finding  all  circuits  of  a  directed  graph  can  be  computationally  burdensome 
[14].     Fortunately,    [11]   has  given  an  efficient  procedure  for  verifying  that  a  system's 
cycle  time  falls  short  of,  meets,  or  exceeds  a  given  performance  requirement   (but  does  not 
indicate  the  exact  cycle  time) . 


Asynchronous  pipelined  systems  are  modeled  by  Petri  nets  with  special  structure.  These 
systems  contain  computational  elements  that  interface  by  means  of   input  and  output 
registers.     When  an  output  register  contains  results  it  does  not  pass  these  on  to  its 
neighbors  until  their  input  registers  are  vacant   (i.e.   no  longer  being  used  by  their 
computational  elements).     This  interface  is  modeled  by  the  Petri  net  fragment  of  Figure  6. 
This  fragment  has  been  used  for  making  places  of  a  Petri  net  safe  (i.e.  guaranteed  to 
contain  no  more  than  one  token  at  any  given  time)    [13],     A  decision-free  Petri  net  will  be 
called  a  safety  net  if  the  marked  graph  representation  of  the  net  is  a  symmetric  graph. 
This  means  that  if  an  arc  goes  from  node  t^  to  t2 ,   then  there   is  an  arc  going  from  t2 
back  to  ti .     Thus  each  pair  of  transitions   in  the  net  looks  like  the  pair  in  Figure  6  or 
else  has  no  place  in  common.     A  marking  of  a  safety  net  will  be  called  coherent  if,  given 


any  path  P^p^ 


p^t^.  and  any  pair  of  places  p^ , 


where  1  _<   i  _<  k,  one  place 


contains  a  positive  number  of  tokens  and  the  other  contains  no  tokens.     The  motivation 
behind  this  type  of  marking   is  that  a  coherently  marked  safety  net  is   intended  to  model 
pipelined  systems  in  which  input  registers  are  initially  loaded   (as  indicated  by  a  positive 
token  loading)   and  output  registers  are  initially  empty   (as  indicated  by  the  absence  of 
tokens).     Notice  that  the  Petri  net  model  of  the  systolic  array  (Figure  5)   is  a  coherently 
marked  safety  net. 

The  cycle  time  of  a  coherently  marked  safety  net  can  be  computed  in  a  straightforward 
manner.     This  is  a  result  of  the  following  theorem. 

Theorem ;     In  a  coherently  marked  safety  net  that  has  transitions  t i ,   t2 ,    .    .  tn 
with  execution  times  T]_,  T2»   .    .    .  ,  Tn  and  places  Pi »  P2 »   •    •    •  r  Pm  W1th  initial 
token  loadings  Pi  *  ?2 »   ■    •    • '   pm  tne  cycle  time  is  given  by 


C  =  max 


T.+Tk 

p  +p     :   1  _<   i  _<  n ,  p     is  an  output  place  of  t.    and  an  input  place  of  t^ ,   p^  is 


q  r 
of  t- 

The  theorem  follows  from  the  fact  that 


4  \ 
an  input  place  of  tj  and  an  output  place  of  t^ ?. 


T. 
l 


<  max 


Ti+Ti+1 
Pi+Pi+1 


1  <   i  < 


where  the  T^'s  and  P^'s  are  non-negative  and  n  is  even. 

The  theorem  implies  that  the  exact  cycle  time  of  a  coherently  marked  safety  net  can  be 
calculated  by  enumerating  the  performance  quotients  for  all  loops  containing  exactly  two 
transitions  and  two  places.     This  procedure  can  be  used  to  quickly  find  the  cycle  time  of 
the  systolic  array  represented  in  Figure  5.     In  an  n  by  n  hexagonal  array  transition 
t. compute. i. j  has  a  fixed  value  for  all  i  and  j,   but  transition  t. send. (n,n) . (n,n+l)   will  be 
maximal  since  it  represents  the  communication  delay  with  the  outside  world,   e.g.   an  output 
device  or  mass  storage.     Therefore  the  loop  containing  transitions  t. compute. n. n  and 
t. send.  (n,n) .  (n,n  +  l)   and  the  places  they  share  will  determine  the  cycle  time  of  the  array. 
The  systolic  array  will  produce  one  result  per  cycle  time. 

4.0  Conclusion 


We  have  presented  a  formal  method  for  determining  the  performance  (as  reflected  by  cycle 
time)   of  architectures  that  employ  pipelining.     This  method  is  based  on  the  use  of  timed 
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Petri  nets  for  modeling  the  system  under  evaluation.     The  advantages  of  this  method  are  that 
it  may  be  applied  more  efficiently  than  existing  methods  and  it  allows  system  designers  to 
evaluate  an  architecture  in  the  pre-implementation  stages  of  the  system's  life  cycle. 
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Figure  2.     Timed  Petri  Net  Example. 
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Figure  3.     Petri  Net  Model 
of  Computational  Element 
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Figure  4.      (a)   Decision-Free  Petri  Net. 

(b)   Corresponding  Marked  Graph. 
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Figure  5.     Petri  Net  Representation  of  Cell  (i,j) 
of  a  Hexagonal  Systolic  Array. 


Figure  6.     Transition  Pairs 
in  a  Safety  Net. 
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Computer  networking  in  the  context  of  very  large  scale  integration  (VLSI) 
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Abst  ract 

As  requirements  for  computing  increase  in  magnitude,  multiple  processor  networks  have 
become  increasingly  important.  The  demand  is  a  direct  outgrowth  of  the  success  of  VLSI  in 
providing  high  levels  of  computation  at  a  modest  price.  Future  levels  of  performance  may 
well  be  most  effectively  met  with  computer  networks  where  the  elemental  processes  are  single 
VLSI  circuits. 

This  paper  surveys  several  common  computer  networking  approaches  and  presents  a  novel 
concept,    the  Gatlinburg  Rings.      It   is  shown  to  be  attractive  for  large  networks. 

I nt roduct  i  on 

With  the  rapid  growth  in  integrated  circuit  performance  over  the  last  decade,  it  is 
attractive  to  consider  the  development  of  complete  systems  on  a  single  chip.  Success  in 
this  area  spawns  greater  expectations  which  are  realized  only  by  the  use  of  systems  com- 
prising tens  to  hundreds  of  functional  elements  (each  of  which  is  implemented  with  a  small 
number  of  integrated  circuits).  The  functional  elements  of  such  systems  may  be  intercon- 
nected using  dedicated  point-to-point  data  links  or  through  a  network  structure.  Dedicated 
links  are  used  when  the  interconnection  is  tailored  to  the  specific  requirements  of  the 
problem  (as  is  often  true  of  signal  processing  systems).  Network  structures  use  regular 
physical  connection  patterns  which  are  general  purpose  in  the  sense  that  they  are  not  tai- 
lored to  a  specific  application. 

It  is  well  known  from  experience  with  processor  development  that  tailored  special  purpose 
designs  often  attain  efficiency  by  sacrificing  flexibility. 

The  goal  of  computer  networking  is  to  achieve  high  levels  of  computation  (i.e., 
throughput)  for  general  purpose  applications  through  the  use  of  regular  physical  inter- 
connection structures.  In  this  paper 
several  networking  approaches  are 
examined  to  gain  an  understanding  of 
the  performance  and  cost  of  the 
various  approaches. 


The  notion  of  a  computer  network 
is  shown  in  Figure  1.  Several  com- 
puters are  connected  together  via  an 
interconnection  network.  The  inter- 
connection network  provides  data  (and 
control)  communications  between  the 
various  processors  and  provides  input 
and  output  connections  for  user 
interface . 

The  operational  advantages  of  com- 
puter networking  have  become  widely 
recognized.  These  structures  can  be 
extremely  flexible  since  the  task 
assignment  to  individual  processing 
elements  is  via  software  rather  than 
through  a  special  purpose  architec- 
ture and  a  fixed  hardware  configuration.  With  appropriate  design  of  the  network 
communications  protocol,  the  number  of  computers  in  the  network  can  be  increased  or 
decreased  in  response  to  changes  in  the  usage  requirements.  Similarly,  a  sufficient 
versatile  network  communications  protocol  allows  the  inclusion  of  a  variety  of  computers 
with  a  wide  variety  of  speeds,  computational  capabilities,  physical  configurations,  etc., 
since  the  only  constraint  on  the  computers  is  that  the  interface  to  the  communications 
network  must  obey  a  predetermined  protocol. 
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Computer  networks  achieve  increased  reliability,  survivability,  and  modularity  by  virtue 
of  the  loose  coupling  between  computers.  This  is  achieved  at  the  cost  of  increased  design 
and  test  effort.1  The  networks  are  capable  of  resisting  obsolescence,  since  the  communica- 
tions structure  may  remain  intact  while  some  or  all  of  the  computers  are  upgraded  with  newer 
technology  to  achieve  higher   levels  of  performance. 


Example  networks 


Since  a  wide  variety  of  network  architectures  have  been  developed  over  the  last  decade,  a 
brief  description  of  several  of  the  more  common  networks  will  provide  an  introduction  to  the 
options  available  to  the  system  designer. 

Distributed  networks  of  two  predominant  types  have  been  developed.  In  the  first  type, 
data  transfers  are  made  between  processors  and  shared  memories.  In  the  second  type,  data 
transfers  are  performed  between  processors.  In  the  first  case,  each  processor  may  have  an 
attached  working  memory  or  may  use  the  shared  memories  (accessing  them  via  the  network).  In 
the  second  case,  all  processors  have  attached  memories.  Intuitively,  the  attached  memory 
approach  seems  most  efficient,  although  definitive  results  have  not  been  reported  in  the 
literature. 
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Since    the    complexity   of  this 
grows     as    N2  ,     the  crossbar 
has     been    used    primarily  for 
where    maximum  interconnection 
is    required    for    a  relatively 
of  nodes. 
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Crossbar  network:  The  crossbar  network,  originally 
exchange  networks,  is  shown  in  Figure  2.  Processing 
memories,  I/O  devices,  etc.)  are  intercon- 
nected through  the  ends  of  the  orthogonally 
intersecting  buses;  switches  at  each  of  the 
bus  intersections  allow  any  processor  to  be 
connected  to  any  resource  in  the  network. 
Control  of  the  switches  along  any  row  of  the 
crossbar  is  performed  by  the  processor  asso- 
ciated with  that  row.  Before  setting  a 
switch,  the  processor  verifies  that  the 
vertical  bus  at  the  desired  switch  is  not  in 
use.  The  processor  then  sets  the  switch  and 
can  communicate  with  the  selected  memories  as 
though   it  had  dedicated  point-to-point  link. 
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Figure  2.     Crossbar  network 


Star  network:  The  star  network  consists  of  a  central  switching  hub  with  data  channels  from 
the  hub  to  the  processing  elements,  as  shown  in  Figure  3.  The  data  channels  contain  a  pair 
of  high-speed  unidirectional  data  links.  Data  paths  can  be  established  between  any  two  or 
more  of  the  processing  elements  via  the  switching  and  routing  structure  located  in  the  hub. 
Any  node  may  be  assigned  as  either  a  source  or  destination.  The  hub  is  not  a  crossbar 
structure  in  this  context,  but  rather  a  data  multiplexer  at  the  hub  input  and  a  programmable 
multiple-tap  demultiplexer  at  the  hub  output,  to  allow  multiple  destinations  to  receive  the 
data  stream.  With  this  implementation,  only  a  single  data  source  may  be  active  at  any  time. 
Star  networks  are  thus  unsuitable  if  multiple  sources  must  transmit  data  simultaneously. 
Several  mechanisms  for  controlling  such  a  structure  have  been  described.  Generally  the  hub 
contains  a  special  processing  element  that  serves  as  the  controller.  Requests  for  service 
at  the  various  data  ports  are  transmitted  to  the  controller  which  programs  the  hub. 

Cube  network:  The  cube  network  is  currently  undergoing  intensive  theoretical  analysis.2'  In 
this  approach,  a  group  of  nodes  are  interconnected  through  ranks  of  exchange  boxes.  As 
shown  in  Figure  4,  each  exchange  box  accepts  a  pair  of  inputs  from  exchange  boxes  in  the 
previous  rank,  and  connects  the  two  inputs  to  the  two  outputs  in  either  a  straight  or 
exchange  pattern.  Communications  paths  between  nodes  on  the  network  input  and  output  can  be 
established  via  external  control  of  the  switch  settings3  (as  for  the  star  network)  or  the 
transmitted  data  can  be  converted  into  packets.  Each  packet  is  prefixed  by  a  destination 
address  which  is  decoded  in  real  time  by  each  successive  switching  node  to  determine  the 
proper  setting  to  transmit  the  packet  to  the  next  rank  of  switches.4  Transmission  of  data 
through  a  cube  network  in  packets  requires  that  each  node  contain  a  small  amount  of  memory 
to  provide  temporary  storage  for  packets  in  transit,  and  also  sufficient  logic  to  decode  the 
packet   destination   address.      Packet    transmission   through    a    cube    network    is    susceptible  to 
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Figure  4.     Cube  network 


network  blockage  which  can  occur  if  a  packet  is  transmitted  to  an  exchange  box  whose  paths 
are  already  engaged.  The  blockage  problem  can  be  reduced  by  controlling  the  individual 
exchange  boxes  via  an  external  controller,  but  this  creates  a  potential  single  point  failure 
that  can  disable  the  entire  network. 

Ring  network:  A  ring  network  is  a  sequential  connection  between  processors,  with  the 
outputs  of  ring  node  I  connected  to  the  inputs  of  ring  node  I  +  1  and  so  on  until  the 
outputs  from  ring  node  N  are  connected  to  the  inputs  of  the  first  ring  node,  as  shown  in 
Figure  5.  During  operation,  all  nodes  transfer  data  simultaneously  to  their  successors 
around  the  ring.  Thus  data  from  the  Ith  node  is  clocked  into  the  I  +  1st  node,  then  into 
the  I  +  2nd  node,  etc.  The  inputs  and  outputs  of  each  ring  node  in  the  ring  are  accessible 
to  an  external  device,  which  can  read  data  from  the  ring  node  as  the  data  words  pass  by  on 
the   ring;    the  external  device  places  data  on  the  ring  via  the   ring  register. 

In  their  basic  form  ring  structures  are  vulnerable  to  failures  in  the  communication  links 
between  the  nodes.  One  approach  to  greatly  improve  the  fault-tolerance  of  rings,  without 
incurring   an   undue   complexity   penalty,    is   the   braided    ring   network    shown    in    Figure   6.  In 
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Figure  5.     Ring  network 


Figure  6.     Braided   ring  network 
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the  braided  ring  each  node  output  is  routed  to  the  inputs  of  the  immediate  successor  and  the 
second  successor.  Each  node  can  eliminate  its  predecessor  by  using  the  output  from  its 
second  predecessor.  Thus  if  any  link  or  ring  node  is  disabled,  the  network  continues  to 
operate,   although  with  one  less  processor. 

Bus  network:  Bus  networks  use  a  common  shared  data  channel  for  communication,  as  shown  in 
Figure  7.     With  a  bus,   any  processor  may  transmit  a  message  to  any  other  processor  by  use  of 

the  common  data  channel,  although  only 
one  processor  can  transmit  at  a  time. 
Bus  structures  became  popular  in  the 
late  1960s  as  an  attractive  method  to 
couple  a  number  of  relatively  low- 
speed  computer  peripherals  to  a  single 
Central  Processing  Unit  (CPU).5  In 
this  mode,  a  single  bus  controller  is 
used  to  grant  bus  access  to  the 
various  users,  often  according  to  a 
predetermined  priority. 

In  using  a  bus  to  interconnect  a 
number  of  relatively  high-speed 
processors,  the  contention  allocation 
approach  is  currently  popular.6  With 
this  approach,  all  users  have  free 
SHARED  DATA  CHANNEL  access  to  the  bus.     The  user  transmits 

onto    the    bus    and   monitors    the   bus  to 
Figure  7.     Bus  network  verify     that     the     transmission  was 

successful  (i.e.,  that  another  pro- 
cessor was  not  transmitting  at  the  same  time).  Both  users  must  retransmit  if  the  trans- 
mission was  unsuccessful. 
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Fully  connected  network:  A  fully  connected  network  is  one  which  establishes  direct 
point-to-point  data  link  connections  between  all  nodes  in  the  network.7  Although  too 
complex  for  most  large  networks,  fully  connected  networks  are  useful  for  small  networks  with 
high  communications  requirements. 

An  attractive  approach  to  combat  the  high  complexity  of  fully  connected  networks  is  a 
sparse  network  in  which  an  irregular  topology  is  created  with  data  links  as  required  for 
each  specific  application.8  Such  networks  generally  achieve  complexities  that  are  roughly 
proportional  to  the  number  of  processors  in  the  system.  Sparse  networks  have  the  dis- 
advantage for  general  purpose  processing  of  communication  nonunif ormity  (i.e.,  the  ease  of 
communicating  between  two  processors  depends  on  whether  or  not  there  is  a  direct  link 
between  them) .  If  there  is  not  a  direct  link  between  two  processors,  a  number  of  links  are 
"concatenated"  to  create  a  composite  link  between  them.  Such  composite  linking  introduces 
significant,  but  tractable,  control  complexity.  The  sparse  network  approach  is  directly 
applicable  to  the  development  of  signal  processing  systems,  where  the  designer  can  tailor 
the  network  to  the  flow  form  that   is  optimum  for  the  specific  problem. 

Gatlinburg  Rings  network:  The  Gatlinburg  Rings  network  architecture  is  a  hierarchical 
multiple  ring  approach  that  derives  its  name  from  having  been  described  first  at  a  workshop 
in  Gatlinburg,   Tennessee,    in  the  spring  of   1980, 9    and  amplified   in  subsequent  publications.10 

The  basic  structure  of  the  Gatlinburg  Rings  is  shown  in  Figure  8.  In  its  basic  form,  it 
consists  of  a  single  high-level  ring  connecting  K  lower-level  rings.  Additional  rings  may 
be  connected  to  the  lower-level  rings  to  provide  a  third  level  of  rings,  etc.,  and  this 
process  may  be  continued  indefinitely  to  provide  any  desired  number  of  levels  in  the 
hierarchy.  Ring  nodes  for  the  Gatlinburg  Rings  are  similar  to  ring  nodes  in  conventional 
ring  networks.  They  can  be  implemented  quite  effectively  with  commercially  available  TTL  or 
ECL  with  about  50  to  100  SSI/MSI  chips,  including  clock  buffers,  line  drivers,  etc.,'  per 
port.  As  an  alternative,  the  ring  port  could  be  implemented  on  a  single  VLSI  chip  because 
the  design  involves  less  that  5000  gates.  The  VLSI  chip  requires  more  leads  than  can  be 
accomodated  with  current  commercial  packages,   but   leadless  chip  carriers  should  be  adequate. 

The  achievable  ring  clock  rates  range  from  10  MHz  for  a  commercial  TTL  implementation,  to 
50  MHz  for  commercial  ECL  implementations,  to  over  100  MHz  for  custom  VLSI  designs  imple- 
mented with  advanced  ECL  technology.  Only  a  portion  of  the  ring  port  logic  operates  at  the 
ring  clock  rate.  For  example,  the  processor  interface  logic  which  accounts  for  much  of  the 
logic  complexity  may  be  implemented  with  TTL  circuits  to  achieve  high  logic  density  and  low 
parts  count.  FIFO  buffers  with  good  density  allow  speeds  of  20  MHz,  which  serves  as  a  good 
compromise  between  the  high   ring  speed  and  the  lower  processor  speeds. 
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Figure  8.   Gatlinburg  Rings  network 

In  this  configuration,  all  computing  resources  interface  to  the  rings  via  a  processor. 
Thus  data  input,  output,  and  bulk  memory  access  are  performed  through  a  processor.  This 
interface  approach  is  selected  because:  1)  the  processor  provides  an  intelligent  interface 
which  can  check  for  data  or  request  validity,  implement  special  protocols,  enforce  security, 
etc.,  and  2)  the  processor  provides  a  low  cost  programmable  interface  which  may  be  modified 
via  software  reconfiguration  to  provide  the  flexibility  necessary  to  satisfy  a  wide  variety 
of  data  processing  requirements. 

Although  a  detailed  analysis  of  the  delay  of  various  Gatlinburg  Rings  configurations  is 
beyond  the  scope  of  this  paper,  maximum  delay  characteristics  are  of  interest.  For  a 
network  with  N  elements  in  two  levels  (K  nodes  on  the  high-level  ring  and  N/K  nodes  at  each 
lower-level  ring),  the  round  trip  delay  from  one  processor  to  any  other  processor  and  back 
to  the  original  processor  can  be  easily  calculated.  It  is  K  +  2N/K  clock  periods^  Differ- 
entiation with  respect  to  K  and  setting  the  derivative  to  zero  yields  K  =  y  2N ,  for  which 
the  round  trip  delay  is  2K  (K/2  each  for  the  source  and  destination  low-level  rings  and  K 
for  the  high-level  ring).  For  practical  systems  it  is  necessary  to  use  integer  ring  sizes 
and  the  round  trip  delay  is  slightly  greater.  An  interesting  case  is  that  in  which  the 
number  of  elements  on  the  rings  are  a  power  of  2  (i.e.,  8,  16,  32,  etc.)  as  this  simplifies 
the  addressing.  For  N  =  128,  these  design  rules  produce  a  16-node  high-level  ring  with  low- 
level  rings  of  8  nodes  each.  The  total  round  trip  delay  from  any  low-level  ring  node  to 
another  node  on  a  different  low-level  ring  is  32  and  represents  an  improvement  by  a  factor 
of  4  in  the  delay  of  a  conventional  ring.  Selection  of  N  =  128  and  K  =  16  provides  simple 
addressing,  as  the  low  order  3  bits  of  each  address  can  select  the  individual  processor  on  a 
low-level   ring  and   the  top  4   bits    (of   the  7-bit  address)    select  the  low-level  ring. 

Network  characterization 

Networks  can  be  compared  by  a  variety  of  criteria  depending  on  the  specific  application. 
One  approach  is  to  estimate  the  network  performance,  cost,  and  quality  (defined  here  as  the 
performance  divided  by  the  cost).  Although  these  parameters  are  easily  estimated  for  the 
example  networks  described  in  the  previous  section,  it  is  necessary  to  emphasize  the 
immaturity  of  these  criteria.  It  is  currently  difficult  to  examine  network  performance  in 
the  absence  of  a  specific  application.  Accordingly,  these  criteria  may  be  subject  to 
significant  change   in  order  to  develop  an  effective  network  comparison  scheme. 

Network  performance 

The  network  performance  depends  on  the  media  bandwidth  (i.e.,  the  product  of  the  number 
of  active  data  links  times  the  link  bandwidth)  and  the  path  length  (i.e.,  the  delay  incurred 
in  the  transfer  of  a  message  from  one  processor  to  another  and  back  to  the  originating 
processor) .  These  parameters  are  summarized  in  Table  1  for  the  networks  described  in  the 
previous  section.  The  first  column  presents  the  maximum  number  of  messages,  M,  active  in 
each     type     of  system.       The     link     bandwidth,   B,      is  assumed  to  be   1  for   links  connecting  a 
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Table  1.     Network  Performance  Comparison 
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single  source  to  a  single  destination,  and  1/N  for  links  with  N  destinations.  This  penalty 
for  multi-drop  links  is  due  to  the  increased  capacitive  loading  and  length,  which  generally 
increase  in  rough  proportion  to  the  number  of  taps.  The  round  trip  message  delay,  D,  is  the 
number  of  links  traversed  by  a  message  in  transit  from  one  processor  to  another  and  back  to 
the  original  processor.  Unlike  one-way  delay,  the  round  trip  delay  measure  is  constant  for 
all  source  and  destination  processor  pairings. 

The  network  performance,  NP ,    is  estimated  by  the  relation: 

NP   =  MB/D  (1) 

Larger  values  of  NP  indicate  more  favorable  network  performance.  This  relation  indicates 
the  desirability  of  networks  capbale  of  carrying  multiple  messages,  with  high  link  band- 
width,  and  with  minimal   links   in  a   round  trip  message  path. 

The  performance   ranges   from  N/2   for   fully  connected   networks   to  for  the  Gatlinburg 

Rings,  1  for  the  ring  network,  1/2  for  the  cube  network,  1/4  for  the  crossbar  and  star 
networks,  and  1/2N  for  bus  networks.  Since  differences  of  a  factor  of  2  to  4  are  probably 
negligible,  this  indicates  that  fully  connected  networks  offer  the  best  performance,  bus 
networks  the  poorest,   and  the  other  networks  are  intermediate. 

Network  cost  characterization 

The  cost  or  complexity  of  a  network  is  basically  the  sum  of  the  link  cost  and  the  switch 
cost.  In  the  absence  of  contradictory  data,  equal  weight  is  assigned  to  those  two 
components.  The  number  of  data  links,  L,  is  shown  for  each  of  the  six  networks  on  Table  2. 
The  cost  or  complexity  of  the  switch(es)  is  the  product  of  the  number  of  switches,  I,  times 
the  complexity  of  each  switch.      The   complexity  of    individual   switches   is   the  product   of  the 


Table  2.     Network  Cost  Comparison 
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number    of    poles,    P,    and    the    number    of  positions,    T;    thus    an    N:l    switch    is    equivalent  in 

complexity  to  a  1:N  switch,  N  times  as  complex  as  an  SPST  switch,  etc.  The  total  network 
cost,  NC ,    is  given  by: 

NC   =  L  +  IPT  (2) 
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Increasing  values  of  NC  indicate  a  more  costly  or  more  complex  network.  This  measure 
increases  as  the  number  of  links  or  the  number  or  complexity  of  the  switches  increases. 
This  measure  ranges  from  N  +  1  for  the  bus  network  to  (3N2  -  N)/2  for  the  fully  connected 
network . 

Network  performance/cost  ratio 

The  network  performance,  NP,  and  cost,  NC ,  from  equations  (1)  and  (2)  may  be  combined  to 
give  a  network  quality  function: 

NQ  =  NP/NC  (3) 

The  quality  ratios  are  computed  in  Table  3.  Increasing  NQ  values  are  more  desirable  as  they 
indicate  increased  network  performance,  decreased  cost,  or  some  combination  of  these  posi- 
tive attributes.     In  all   cases,   the  network  quality  decreases  with   increasing  N,  indicating 

Table  3.     Network  Quality  Comparison 
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that  network  performance  is  best  for  small  networks  and  decreases  as  the  number  of  nodes 
increases.  The  quality  values  range  from  a  proportionality  to  1/N  for  the  fully  connected, 
ring,  and  star  networks,  to  a  proportionality  to  l/(N(Log2N)2  )  for  the  cube,  to  a 
proportionality  to  1/N2  for  the  bus  and  crossbar  networks.  These  figures  are  consistent 
with  the  intuitive  notion  that  networks  become  less  efficient  as  the  number  of  nodes 
increases  and  generally  confirm  prior  results,  although  definitive  comparisons  have  not  been 
reported  in  the  literature.  Figure  9  is  a  graph  of  the  network  quality  as  a  function  of 
size  for  networks  of  4  to  128  processors.     The  fully  connected  network  offers  the  best 
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Figure  9.     Network  quality  vs.   size  comparison 
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quality  for  small  systems  while  the  Gatlinburg  Rings  is  best  for  large  systems.  Again,  it 
is  appropriate  to  emphasize  that  this  network  quality  does  not  reflect  application-specific 
characteristics  (i.e.,  data  flow  rates  and  distributions)  which  must  be  examined  in  detail 
before  selecting  a  network. 

Conclus  i  ons 

This  paper  has  examined  computer  networking  in  the  context  of  VLSI.  A  network  quality 
measure  has  been  defined  to  aid  in  the  selection  and  comparison  of  networks.  It  agrees  well 
with  previous  intuitive  results  and  should  be  useful  for  first  order  comparison  of  networks. 
With  this  quality  measure  the  Gatlinburg  Rings  network  has  been  shown  to  be  attractive  for 
large  systems. 
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Abstract 

The  U.   S.   Army  operates  a  field  laboratory  where  realistic  combat  simulations  between  jet 
aircraft,   helicopters,   tanks,   and  infantry  can  be  closely  observed.     Lasers  are  used  to 
simulate  the  weapons  carried  by  as  many  as  200  players.     Laser  firings,  hits,   and  player 
location  are  monitored  by  a  telemetry  and  range  measurement  system  controlled  by  a  computer 
network.     Player  combat  engagements  are  evaluated  in  real-time  by  the  computer  network  and 
the  results  returned  to  the  player. 

The  computer  network  primarily  consists  of  12  PDP-ll/45s  and  a  DEC-1060.     The  PDP-ll/45s 
operate  under  RSX-11M  and  RSX-llS.     Each  PDP  11/45  processor  communicates  with  the  other 
processors  through  a  32K  shared  memory.     Application  software  includes  telemetry  polling 
and  control,   player  position  calculation,   real-time  casualty  assessment,   and  various 
monitors  and  displays.     The  major  focus  of  this  paper  is  the  development  of  a  successful 
high  speed,   general  purpose  interprocessor/intertask  data  communication  system,  operating 
within  the  shared  memories,  which  facilitates  concurrent  processing  of  data  with  minimal 
overhead. 

U.   S.   Army  Combat  Developments  Experimentation  Command   (USACDEC)  mission 

USACDEC  is  charged  with  the  responsibility  of  performing  the  operational  test  and  evalua- 
tion of  new  weapon  systems  and  tactics.     In  a  single  experiment,   testing  may  include  such 
diverse  systems  as  U.   S.  Air  Force  and  Marine  high  performance  attack  aircraft,   U.   S.  Army 
attack  helicopters,   and  armored  vehicles  acting  in  concert  against  simulated  Soviet  armored 
vehicles,   anti-aircraft  defenses,   and  attack  helicopters.     Battlefield  realism  is  approxi- 
mated as  closely  as  possible  including  the  effects  of  electronic  warfare,  noise,  and 
smoke.     The  experimentation  results  are  utilized  in  funding  decisions  and  to  provide 
perspective  regarding  the  relative  merits  of  various  weapons  systems  and  approaches  to 
their  employment.     Careful  attention  is  paid  to  assuring  the  statistical  validity  of 
experiment  designs  and  their  execution.     When  a  single  one  hour  trial  of  a  multi-trial 
experiment  may  cost  more  than  half  a  million  dollars,   a  great  deal  of  effort  is  devoted 
to  guaranteeing  the  validity  and  robustness  of  experimental  data. 

Traditionally,   tactical  combat  simulation  has  been  accomplished  through  the  use  of  on- 
site  human  referees  and  gun  cameras.     This  method  left  much  to  be  desired,   since  the 
necessary  presence  and  actions  of  the  referees  caused  considerable  player  distraction  and 
allowed  only  a  semblence  of  combat  realism.     Additionally,   referee  casualty  assessments  and 
data  collection  activities  are  relatively  subjective  and  often  error  prone.     In  addition, 
since  player  feedback  does  not  occur  in  "real-time",   experiment  results  have  often  been  less 
than  definitive. 

Several  years  ago,   an  automated  instrumentation  system  for  performing  casualty  assessment 
at  USACDEC  was  conceived  and  built.     Using  microprocessor-controlled  lasers  to  simulate  the 
active  elements  of  weapons  systems,   a  range  determination  and  telemetry  network,  and  a 
central  computer  for  data  collection,   the  system  allows  accurate  casualty  assessments  and 
player  feedback  to  be  performed  in  a  near  real-time  manner  while  providing  a  trustworthy 
means  of  data  collection. 

An  overview  of  the  instrumentation  system 

The  Range  Measuring  System   (RMS)    and  the  Multi-Computer  System   (MCS) ,   coupled  with  the 
Player  Instrumentation  System,   allow  the  conduct  of  free  play,   f orce-on-f orce  simulated 
combat  trials.     The  system  allows  tailoring  of  casualty  assessment  and  data  collection 
methodologies  to  the  particular  weapons  systems  and  trial  conditions  under  study. 

The  RMS  operates  at  a  frequency  of  918  MHz  and  requires  line  of  sight  between  sender 
and  receiver.     The  principal  components  are  a  central  control  station,   fixed  site  inter- 
rogator units,   and  transponders  mounted  on  each  player.     Player  position  is  calculated 
by  the  MCS  using  a  multilateration  process  using  the  range  between  a  number  of  interrogator 
units  and  the  player.   The  RMS  also  provides  the  mechanism  whereby  player  actions  and  status 
are  transmitted  to  the  MCS  for  processing  during  real-time.     The  telemetry  system  is 
capable  of  over  1200  range  and  action  messages  per  second. 
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The  function  of  the  lasers  is  to  identify  the  target  when  a  weapon  firing  takes  place. 
The  system  consists  of  eyesafe  lasers  and  sensors  mounted  on  each  player.     The  laser  is 
boresighted  with  the  sighting  system  of  each  player's  weapon.     When  a  weapon  is  fired,  the 
laser  is  activated  and  modulated  with  a  code  unique  to  the  weapon  and  player. 

When  one  player  fires  at  another,   the  following  sequence  of  events  occurs.     The  firer, 
upon  squeezing  his  trigger,   causes  a  pyrotechnic  device  to  detonate  on  his  weapon,  repre- 
senting its  firing;   a  "fire"  event  is  sent  to  the  MCS  via  the  player's  transponder  and  the 
telemetry  system,   and  the  laser  is  activated.     The  target,   upon  being  illuminated  by  the 
laser,   decodes  the  firer 's  identification  and  sends  a  message  to  the  MCS  that  it  has  been 
"paired".     The  MCS  uses  the  firer  and  target's  identity  with  their  respective  positions 
(which  have  continually  been  tracked)   and  the  engagement  conditions    (e.g.,   the  range) 
to  stochastically  calculate  a  result  for  the  engagement.     If  the  result  is  a   "kill",  the 
player's  laser  is  disabled  by  the  microprocessor  and  a  buzzer  is  sounded  to  alert  the  human 
player.      If  the  result  is  a  "survive",   a  light  indicates  to  the  player  that  he  is  under 
fire. 

Computer  applications  required  to  support  an  experiment 

The  MCS,   generates  commands  for  the  range  measurement  and  telemetry  system,   does  the 
position  determination  calculations,   performs  the  real-time  casualty  assessment  simulations, 
and  provides  extensive  displays  for  experiment  monitoring  and  control. 

Range  Measurement  System  control 

Each  player  type  has  different  polling  frequency  requirements  depending  on  its  speed  and 
activity.     Fast-moving  players;   e.g.,   aircraft,   require  a  range  polling  rate  10  times 
greater  than  a  slow  moving  player;   e.g.,   a  tank,   to  achieve  the  same  accuracy  of  position 
data.     The  MCS  generates  range  requests  directly.     The  telemetry  system,   upon  satisfying  a 
range  request,  will  also  return  player  action  messages  gratuitously.     Polling  rates 
typically  vary  from  4  range  requests  per  second  for  "normal"  players  to  40  or  more  for 
active,   fast-moving  players.     The  instrumentation  system  must  maintain  a  player  action  and 
position  time  resolution  of   .01  seconds  at  all  times. 

Due  to  the  nature  of  combat,   players  of  all  types,   including  aircraft,  will  attempt  to 
use  terrain  features  for  concealment.     In  doing  so,   they  are  quite  likely  to  conceal  them- 
selves from  most  interrogator  units  as  well  as  their  opponents.     In  order  to  generate  high 
quality  position  information,   the  changing  player  positions  must  be  constantly  rematched 
by  the  MCS  with  a  new  set  of  interrogators  with  appropriate  relative  geometries.     In  some 
instances,   one  or  more  interrogators  are  mounted  on  an  aircraft  stationed  above  the  playing 
area.     Typically,   position  accuracy  requirements  are  on  the  order  of  10  meters. 

Position  determination 

The  position/location  algorithm  is  a  Kalman  filter.     The  Kalman  filter  uses  the  available 
information  to  calculate  each  player's  position  and  its  speed  vector.     The  Kalman  is  a 
predictive  type  algorithm  and,   as  such,   can  detect  and  attempt  to  correct  bad  or  missing 
data.     In  order  to  assure  that  accurate  position  data  is  immediately  available  for  all 
players  during  real-time,   four  Kalman  filters  are  operated  concurrently  to  process  range 
data . 

Real-time  displays 

The  Visual  Information  Display  System   (VIDS)    is  a  1.5m  by  2m  color  graphics  display  with 
additional  high  resolution  color  graphic  terminals.     Each  player's  position  and  actions  may 
be  displayed  in  real-time  against  a  digitized  map  of  the  playing  area.     Each  player's  type 
and  status  is  shown  using  different  symbols,   and  player  actions  are  shown  dynamically  in  a 
near  real-time  mode. 

In  order  to  properly  control  each  experiment  and  monitor  the  instrumentation  systems 
performance,   several  additional  displays  are  used.     These  displays  reflect  reported  player 
coordinates,   raw  action  data,   instrumentation  performance  statistics,  and  the  like.  A 
continuous  log  is  made  of  all  raw  data  received  by  the  system  and  of  all  major  data  process- 
ing results,   such  as  the  position  calculations  and  casualty  assessment  results. 

Real-time  casualty  assessment  simulations 

The  real-time  casualty  assessment  simulations  must  rapidly  validate  an  engagement  as 
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to  its  legality  (e.g.,  the  firer  has  ammunition)  and,  prior  to  calculating,  if  a  player  has 
been  damaged  or  killed. 

A  wide  variety  of  differing  conditions  can  affect  the  outcome  of  a  simulated  engagement. 
The  effects  of  such  variants  as  targeting  modes  and  current  firing  doctrine  must  be 
accounted  for  depending  on  player  type  and  status.     In  order  for  an  assessment  to  begin, 
a  valid  engagement  must  be  detected  and  reported.     Such  parameters  as  whether  the  target 
or  the  firers  are  moving  or  stationary  are  important  as  well  as  the  range  at  the  time  of 
firing.     Some  weapons,   notably  the  anti-armor  missiles,   do  not  require  illumination  of  the 
target  at  the  time  of  firing.     Multiple  ammunition  types,   target  aspect  relative  to  the 
firer  at  time  of  firing,   and  time  of  round  impact  must  all  be  considered  in  determining 
which  of  several  types  of  damage,   if  any,   to  assign  to  the  target.     In  the  case  of  the 
surface-to-air  missile    (SAM)    simulations,   the  target  aircraft  location  is  continuously  fed 
to  the  proper  MCS  SAM  casualty  assessment  model  in  real-time.     The  computer  model  simulates 
the  aerodynamic  and  guidance  control  properties  of  the  missile  and  literally  "flies  out" 
the  missile  to  the  aircraft. 


The  USACDEC  network  hardware  configuration 


The  nucleus  of  the  USACDEC  instrumentation  system  is  the  network  of  15  computers  provid- 
ing real-time  experiment  control,   data  collection,   and  simulation.     Figure  1  shows  the 
logical  arrangement  of  the  MCS  and  the  additional  support  computers  which  it  controls. 
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Figure  1.     Configuration  of  the  CDEC  multi-computer  system 


The  computers  comprising  the  system  are: 

(1)  a  DEC-1060  computer    (36-bit  word,    512K  memory,  cycle  time  =  lus); 

(2)  12  PDP-11/45  computers    (16-bit  word,   32k  or  64K  of  private  core,  900ns 
cycle  time;    4K  or  8K  private  MOS,   450ns  cycle  time) ; 

(3)  a  Varian  V-73  computer    (16-bit  word,   64K  MOS  memory,    300ns  cycle  time) . 

(4)  A  MODCOMP  II  computer    (16-bit  word,   64K  memory,   800ns  cycle  time) . 

(5)  4  shared  4-ported  core  memories  of  32K  each. 

The  12  PDP-ll/45s  are  configured  into  four  stems.     The  "master"  processor  on  each  stem  is 
connected  to  its  satellites  by  a  shared  core  memory  of  32K.     The  masters  are  connected  with 
each  other  by  an  additional  32K  of  shared  core  memory. 
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Stem  1  consists  of  three  ll/45s  and  controls  the  RMS  and  communications  between  the  MCS 
and  the  players.     One  11/45  may  be  used  to  support  special  instrumentation  requirements 
or  held  in  reserve  as  a  spare;   a  second  executes  the  RMS  player  polling  and  communications 
control  algorithms;  while  the  third  provides  interrogator  unit  selection,   directs  items 
in  the  incoming  data  stream  to  their  proper  destinations,   and  multiplexes  outgoing  data 
items  into  a  single  data  stream. 

Stem  2  consists  of  four  ll/45s  and  performs  the  position  calculations  for  all  players 
using  the  Kalman  filter  algorithms. 

Stem  3  consists  of  four  ll/45s  and  may  be  used  to  perform  the  SAM  casualty  assessment 
simulations  when  required. 

Stem  4  consists  of  a  single  11/45  and  acts  as  the  interface  between  the  other  ll/45s, 
the  DEC-10,   and  the  VIDS  large  screen  display  system.     Stem  4  is  also  connected  to  the  stem 
shared  memory  of  Stem  1. 

The  DEC-10  performs  casualty  assessments  for  ground-to-ground  and  air-to-ground  weapons 
and  serves  as  the  instrumentation  system  monitoring  and  control  center. 

The  V-73  acts  as  a  controller  and  buffer  between  the  RMS  and  the  MCS.  It  is  connected 
directly  to  the  Range  Measurement  System  control  hardware.  The  V-7  3  is  connected  to  Stem 
1  through  a  microwave  relay  operating  at  230.4  kbs . 

The  MODCOMP-II  is  dedicated  to  the  VIDS  system.     The  VIDS  consists  of  an  Advent  large 
screen  color  display  interfaced  to  an  Aydin  color  graphics  controller.     The  system  also 
supports  several  Aydin  color  graphics  terminals  and  a  map  digitization  system. 

The  disks  shown  are  RP06  disks  with  a  176  MB  word  capacity  and  an  average  access  time  of 
36ms.     The  disks  log  all  raw  data,   all  position  calculation  results  and  all  actions  taken 
by  the  SAM  models.     Not  shown  are  two  RP06  drives  and  eight  40MB  RP03  drives  used  by  the 
DEC-10  for  casualty  assessment  simulation  logs,   initialization  and  operator  data  logs,  and 
instrumentation  system  performance  logs.     Also  not  shown  are  the  2.5MB  RK05  disks  attached 
to  each  master,   used  for  program  development  and  real-time  initialization. 

The  master  PDP-ll/45s  use  the  RSX-llM  operating  system,  with  RSX-llS  used  on  the 
satellites.     Both  versions  have  been  stripped  to  their  minimum  size.     RSX-llM  requires 
less  than  9K  and  RSK-11S  requires  only  7K.     Applications  code  is  written  in  FORTRAN  and 
MACRO-11.     The  DEC-10  uses  TOPS-10  and  a  tailored  application  of  the  Common  communications 
Area  and  High  Segment  Common  region.     The  DEC-10  applications  code  is  in  FORTRAN. 

USACDEC  network  structure 

The  network  structure  consists  of  four  levels:   a  user  or  applications  software  level,  a 
linkage  control  level,   a  physical  linkage  control  level,   and  the  hardware  level. 

User  level 

At  user  level,   user  tasks  generate  messages  for  transmission  to  other  tasks  and/or 
receive  messages  from  other  tasks.     The  user  task  is  not  aware  of  the  network,   the  physical 
location  of  the  transmitting/receiving  process,   or  the  actual  routing  of  the  message. 
The  data  type  contained  within  the  message  header  serves  as  an  implicit  message  address. 

The  user  task  communicates  with  other  tasks  through  the  logical  link  control  level  by 
invoking  some  combination  of  the  following  three  functions:     attach/detach  from  a  logical 
link,   transmit/receive  data  over  a  logical  link,   and  interrupt  a  process.     A  library  of 
standard  macro  definitions  and  user-callable  subroutines  is  utilized  to  facilitate 
applications  task  access  to  the  network. 

Logical  link  control  level 

From  a  practical  standpoint,   the  network  functions  at  the  logical  link  control  level. 
The  logical  link  control  software  maintains  data  synchronization,   assures  data  integrity, 
multiplexes  data  into  the  network,   and  controls  data  flow  through  the  logical  links  as  well 
as  user  task  access.     The  software  at  the  logical  link  level  is  referred  to  collectively  as 
the  queue  management  software  and  is  used  primarily  in  support  of  PDP-11/45  interprocessor 
communications . 

The  queue  management  software  supports  the  following  functions: 
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(1)  File  manipulation 

(2)  File  transfer  between  any  file  structured  devices  within  the  network 

(3)  File  to  queue  operations 

(4)  Queue  to  file  operations 

(5)  Intertask/Interprocessor  communications 

(6)  Terminal  communications,   permitting  any  terminal  within  the  network  to 
communicate  with  any  operating  process  within  the  network 

The  USACDEC  environment  imposes  certain,   somewhat  unique,   requirements  on  the  MCS . 
Chief  among  these  are  the  following: 

(1)  Each  new  experiment  has  radically  divergent  data  collection 
and  processing  requirements  in  terms  of  data  definitions, 
speeds,   and  effects  of  specific  player  actions. 

(2)  In  order  to  provide  graceful  degradation,   the  ability  to  move 
tasks  between  processors    (within  hardware  limitations)  without 
reprogramming  is  necessary. 

(3)  The  lack  of  satellite  peripheral  devices  requires  support  for 
generalized  down-loading  of  the  satellite  processors  through  the 
network . 

As  a  result,   the  logical  links  within  the  network  are  viewed  as  data  paths  rather  than 
interconnections  between  nodes.     Instead  of  connecting  to  a  node  of  the  network,   a  user 
process  desiring  a  specific  data  type  connects  to  the  data  path  supporting  that  data  type. 
The  data  path  may  serve  as  either  an  input  point,  an  output  point,  or  both.     As  new  data 
types  and  their  applications  are  added  to  the  network,   user  tasks  specify  the  name  of  the 
data  path  where  a  particular  data  type  will  occur  and  the  network  software  handles  the 
actual  data  transmission.     This  approach  permits  running  only  those  parts  of  the  system 
that  are  needed  for  a  given  experiment  and  the  ability  to  readily  interchange  tasks  between 
processors.     As  a  result,   flexibility  for  hardware  fallback  and  in  the  scheduling  of  com- 
puter time,   has  been  greatly  increased. 

Physical  link  control  level 

Operational  network  supervision  is  provided  at  the  physical  link  control  level.  These 
routines  serve  as  the  interfaces  between  the  vendor  supplied  and  CDEC  developed  device 
drivers,    the  queue  management  software,   and,    in  some  cases,   user  tasks.     Among  the  functions 
performed  at  this  level  are:     message  routing,  message  segmenting  and  blocking,   and  message 
logging . 

Hardware  level 

The  hardware  level  transmits  and  receives  data  over  the  physical  links.     This  level's 
software  is  implemented  by  both  DEC  supplied  and  BDM  developed  device  drivers  and  interrupt 
service  routines. 

Logical  link  control  software  requirements 

Extensive  system  support  is  required  to  allow  concurrent  applications  software  execution 
using  shared  data  bases  among  the  tightly  coupled  multiple  processors  of  the  network  as 
well  as  for  effective  multitasking  within  each  computer.     The  RSX-11  operating  system 
effectively  supports  the  execution  of  independent  tasks  in  a  single  processor,   and  permits 
the  sharing  of  program  libraries  and  data  among  such  tasks.     RSX-11  does  not,  however, 
provide  a  suitably  efficient  communications  mechanism  capable  of  interprocessor  communica- 
tions initiation,   data  transmission,   and  termination.     In  addition,   concurrent  data 
processing  of  the  same  datum  by  multiple  tasks  is  required.     It  is  mandatory  that  shared 
data  not  be  overwritten  or  lost  during  or  prior  to  its  use    (although  in  some  cases, 
certain  data  may  be  "thrown  away"   in  a  controlled  fashion) . 

In  an  environment  where  multiple,   concurrent,   and  simultaneous  processes  have  access  to 
the  same  data  and  memory  space,   it  is  necessary  to  assure  data  base  integrity.  Inconsistent 
concurrent  applications  software  speeds  mean  that  care  must  be  taken  that  all  intended 
receivers  having  fully  accessed  a  data  element  prior  to  its  overwrite  or  deletion. 

Differences  between  data  flow  rates  required  for  casualty  assessment  simulations 
operating  on  different  processors,   variability  in  actual  processing  time  required  for 
a  particular  data  element,   and  independence  of  CPU  processes  can  combine  to  cause  queue 
wrap-around  situations.     The  resulting  errors  in  consistency  and  data  losses  depend  on 
both  the  data  mix  at  a  given  moment  and  the  relative  and  absolute  software  and  hardware 
timing.     In  general,   these  errors,   if  they  occur,   are  neither  reproducible  nor,   in  most 
cases,   even  detectable.     The  possibility  of  uncontrolled  and  undetected  data  loss  must  be 
demonstrably  absent. 
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The  queue  management  software 


Queues  provide  the  primary  mechanism  for  transferring  data  among  tasks  and  for  buffering 
and  maintaining  the  data  until  it  is  no  longer  needed.     The  data  used  by  the  queue 
management  software  to  coordinate  internal  data  communications  uses  the  same  queues 
and  queue  handling  software  as  all  other  data.     Specially  tuned  queues  provide  the  inter- 
face between  users,   real-time  I/O  devices,   and  between  the  PDP-ll/45s  as  a  group  and  the 
DEC-10.     Applications  and  system  software  may  send  data  to  one  or  more  of  these  devices,  as 
well  as  to  other  software,  by  placing  the  data  in  a  queue  which  has  been  initialized  for 
the  purpose  of  transmission  of  that  data.     Data  received  from  these  devices  is  placed  into 
a  queue  by  the  queue  management  software  for  subsequent  access  by  the  using  task. 

The  queue  management  software  supports  multiple  storage  tasks  and  multiple  retrieval 
tasks  accessing  a  single  queue  simultaneously.     A  mechanism  is  available  for  causing  user 
tasks  to  "sleep"  when  their  input  queues  are  empty,   and  for  "reawakening"  them  when  new 
data  arrives  in  the  appropriate  queues. 

The  queue  management  software  provides  a  copy  capability  which  permits  tasks  which  do 
not  share  memory  directly  to  communicate  through  the  queue  system.     In  addition,   two  or 
more  identical  copies  of  any  queue  may  be  maintained  in  the  same  physical  memory  simultane- 
ously. 

Coordination  between  tasks  is  performed  by  having  each  task  test  its  input  queues  for 
available  work.  Other  tasks,  having  access  to  the  same  memory  space,  provide  data  to  be 
processed.  Since  both  the  queue  data  and  queue  pointers  are  available  to  the  source  and 
destination  tasks,  the  source  task  can  suspend  processing  and  allow  the  destination  task 
to  "catch  up"  when  the  shared  queue  buffer  space  begins  approaching  an  overflow  condition. 

The  data  locking  mechanism 

In  order  to  avert  queue  wrap-around  errors,  handling  of  all  shared  data  structures  is 
coordinated  by  a  protocol  of  read  and  write  locks.     A  lock  word  is  added  to  each  data 
element  in  a  queue  to  specify  whether  a  user  is  writing  into  the  block  or  whether  one  or 
more  users  are  reading  data  from  the  block.     The  lock  protocol  requires  each  user  to  ensure 
that  elements  are  used  only  according  to  the  access  authorization  which  they  request;  i.e., 
user  software  must  both  set  and  release  its  own  locks  and  respect  the  locks  set  by  other 
tasks.     While  this  places  some  burden  on  the  user,   it  also  allows  greater  flexibility  in 
user  actions. 

Queue  wrap-around  can  arise  in  a  circular  queue  when  the  data  storage  process,   in  moving 
around  the   "circle",  overtakes  a  data  retrieval  process.     Locking  provides  a  consistent 
method  for  detecting  and  controlling  data  loss  due  to  queue  overflow.     Optionally,  either 
the  oldest  data  may  be  overwritten  or  new  incoming  data  may  be  ignored  and/or  overwritten. 
In  any  event,   a  data  block  which  is  currently  being  processed;   i.e.,   read  locked,  cannot 
be  modified  by  a  storage  task.     As  a  result,   the  ongoing  processing  of  the  read  locked 
data  element  is  not  aborted,   and  a  straightforward  recovery  procedure  for  the  next  data 
element  can  be  accomplished.     An  error  indication  is  provided  to  all  concerned  tasks  when 
the  overflow  condition  is  sensed.     This  indication  permits  a  reading  process  to  compensate 
to  avoid  being  overtaken  again.     This  procedure  allows  tuning  to  control  the  amount  and 
type  of  data  lost,  while  retaining  considerable  flexibility  in  permissable  recovery 
procedures.     In  addition,   it  allows  notification  of  operators  that  data  has  been  lost  and 
permits  measures  to  be  undertaken  to  compensate. 

There  exist  some  cases  where  a  single  shared  data  area  has  multiple  readers,   either  on 
a  single  processor  or  on  multiple  processors.     The  lock  protocol  provides  a  method  for 
assuring  that  all  intended  readers  have  retrieved  a  data  element  prior  to  the  release  of 
its  memory  area.     This  technique  is  especially  useful  in  those  instances  where  space  is 
allocated  from  memory  pools  in  a  multithread  queue  address  space. 

The  lock  protocol  is  implemented  within  the  queue  management  software.     The  implementa- 
tion includes  facilities  for  entering  a  wait  state  when  a  requested  lock  is  not  available, 
and  for  recovering  when  the  lock  condition  changes.     The  lock  protocol  is  implemented 
through  standardized  macros  and  FORTRAN  calls  to  assure  consistency  of  use  and  interface. 

Queue  types 

The  queue  management  software  supports  two  types  of  queues,  dedicated  queues  and  general 
queues.     Table  1  compares  the  two  types. 


SPIE  Vol.  34 1  Real  Time  Signal  Processing  V  (1982)  /  293 


Table  1.     A  Comparison  of  Dedicated  and  General  Queues 


Relative  Overhead 
Element  Size 

Number  of  Elements  Allowed 

Element  Location 
Garbage  Collection 

Data  Control  Structures 

Error  Routine 
Wait  Routine 


Automatic  Data  Logging  of 
Queue  Elements 

Data  Structure  Consistency 
Checking 

Automatic  Queue  Copying  Across 
Processors 


General  Queue 
High 

Variable 
Variable 

First  Fit 

Contiguous  Elements 
are  Joined  in  Free 
Space 

Created  at  Assembly 
Time 

User  Specified 
User  Specified 

No 

Yes 

Yes,   at  User  Option 


Dedicated  Queue 

Low 

Fixed 

Fixed  at  Initialization 
Call 

Contiguous 
Not  Required 

Created  at  Run  Time 

Fixed  Routine 

Fixed  Routine  or  User 
Specified 

Yes,   at  User  Option 
Yes 

Yes,  at  User  Option 


General  queues 

These  queues  have  attributes  which  are  similar  in  logical  appearance  to  sequential  files. 
The  elements  of  a  queue  are  accessed  on  a  first-in/first-out    (FIFO)    sequential  basis. 

In  order  to  maintain  a  high  level  of  efficiency,   the  general  queue  software  passes 
pointers,   rather  than  data,   to  the  invoking  or  invoked  task.     This  reduces  the  potential 
for  memory  contention  in  the  shared  memory  by  removing  the  requirement  that  data  actually 
be  moved  into  each  task's  local  address  space. 

Since  queue  areas  are  accessible  through  pointers,   there  is  no  necessity  for  those  areas 
to  be  contiguous.     Since  the  areas  need  not  be  contiguous,  multiple  queues  may  be  inter- 
leaved within  an  allocated  memory  space.     Individual  queues  may  expand  and  shrink  as  their 
traffic  volume  dictates. 


The 


Central  to  the  operation  of  the  general  queues  is  the  Queue  Interface  Block  (QIB) 
holds  all  data  necessary  for  the  queue  control  software  to  access  a  user's  local 

The  user  creates  a  QIB  for  each  general  queue  with  the  following  informa- 


QIB 

address  space 
tion 


(1)  Identification  of  the  queue  is  to  be  accessed. 

(2)  Identification  of  an  associated  free  space  queue. 

(3)  A  fatal  error  subroutine  address. 

(4)  A  wait  subroutine  address. 

(5)  The  address  of  a  subroutine  to  alert  other  tasks. 

These  functions  are  performed  by  a  standard  package  of  subroutines  called  by  the  using 
task.     All  subroutine  invocations  receive  the  location  of  the  QIB  on  entrance.     On  exit,  a 
flag  indicates  successful  completion.     If  the  function  has  not  been  completed,   a  code 
indicating  the  cause  will  be  returned.     A  special  exit  is  provided  for  fatal  errors  to 
prevent  abnormal  task  termination  and  to  allow  error  recovery  processing. 

Subroutines  are  provided  to  implement  the  following  general  queue  functions:  initialize 
QIB,  obtain/return  space,   store/release  an  entry,  and  move  to  next  queue  entry.     A  mecha- 
nism is  also  provided  to  cause  a  reading  task  to  "sleep"  if  its  input  queue  is  empty,  and 
to  "awaken"   it  when  new  data  becomes  available. 
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The  normal  processing  sequence  of  a  task  using  this  mechanism  is: 

(1)  Sleep  and  await  the  indication  of  possible  work. 

(2)  Upon  begin  awakened,   the  task  should  clear  all  event  flags 
or  other  mechanisms  used  to  awaken  the  task. 

(3)  The  task  checks  its  input  queues  or  other  activities  for  work. 

(4)  If  work  is  found,   it  is  performed  and  Step  3  is  repeated. 

(5)  If  work  is  not  found,   the  task  returns  to  sleep  status. 

Awakened  tasks  are  not  informed  as  to  why  they  were  awakened.     Once  awake,   a  task  must 
check  its  own  status;   e.g.,   check  for  any  available  work.     If  a  task  polls  for  work  it 
must  use  a  mechanism  which  allows  other  tasks  to  run.     The  task  cannot  assume  knowledge 
of  the  current  processing  status  of  other  tasks. 

Each  QIB  serves  as  an  interface  for  a  specific  user  to  a  specific  queue.     In  use,  the 
QIB  is  used  to  move  to  another  entry.     Similarly,   all  other  QIBs  pointing  to  the  same  queue 
entry  must  be  moved  before  the  entry  can  be  released  back  to  the  available  memory  pool. 

Dedicated  queues 

Dedicated  queues  provide  fast  and  efficient  high  volume  data  paths.     Space  for  each 
queue  is  permanently  assigned,   and  is  subdivided  into  fixed  length  areas,   each  of  which 
may  contain  one  logical  element.     The  use  of  pointers  is  minimized  and  those  used  consist 
of  displacements  relative  to  the  queue  space  origin.     The  pointers  are  normally  stored 
separately  from  the  data.     This  facilitates  the  use  of  block  I/O  devices,   such  as  disks 
for  transfer  of  data  without  requiring  reference  to  the  queue  pointers.     In  general,  the 
queue  processing  software  only  manipulates  the  queue  pointers.     This  is  especially  useful 
in  I/O  and  interrupt  processing.     The  dedicated  management  software  can  be  integrated  with 
the  I/O  driver  software  for  block  transfer  devices,   resulting  in  reduced  overhead  although 
with  the  expense  of  a  somewhat  inflexible  storage  allocation. 

Each  dedicated  queue  user  uses  a  Queue  Interface  Array    (QIA)    (analogous  to  the  QIB)  for 
each  queue  used.     The  address  of  the  QIA  is  included  in  all  queue  management  calls  and  the 
block  is  initialized  and  maintained  by  the  queue  management  software.     The  QIA  contains 
both  access  mode  information  and  a  pointer  to  the  queue  header  which  contains  common  queue 
information  and  a  lock  word  for  each  element,   a  sequence  control  number  and  addressing. 
A  data  element  consists  of  multiple  data  words  located  in  a  contiguous  area  of  data  memory 
located  in  another  system  common  area  of  the  same  class  as  the  header    (stem  or  master 
shared  memory) .     Headers  for  all  queues  in  a  given  class  of  memory  are  stored  in  a  common 
communication  area  with  other  interprocessor  communication  coordination  information. 

A  dedicated  queue  is  initialized  by  specifying:  queue  identity,  data  area  address,  data 
element  length,   element  count,  and  access  mode.     When  other  users  initialize  to  the  same 
queue,   their  initialization  data  is  checked  for  consistency.     The  user  may  request,  by 
the  access  mode  selected,   that  the  queue  management  software  provide  a  memory  management 
register  to  facilitiate  processor  access  to  the  data. 

Each  dedicated  queue  storing  task  invokes  a  routine  to  request  the  next  available 
address  to  store  data  into,   performs  the  storage  operation,  and  then  invokes  a  routine 
signaling  that  the  element  has  been  filled  and  is  now  available  for  use.     Each  reading 
task  tests  for  data  availability  via  a  standard  routine  and  gains  access  to  the  queue  if 
data  is  available.     When  a  data  element  has  been  processed,   a  routine  is  invoked  to  release 
the  current  data,   test  for  more  available  data,   and  gain  access  to  that  element.     A  reader 
may  elect,   by  an  access  code,   to  suspend  processing  pending  data  availability,   or  to  be 
returned  an  error  indication  if  the  input  queue  is  empty. 

Conclusion 

The  queue  management  software  has  been  successfully  utilized  now  for  over  three  years 
on  a  wide  variety  of  different  experiments.     As  a  result  of  timing  and  usage  studies,  the 
general  queue  methodology  is  now  the  preferred  method  of  network  communications.  The 
general  queues  have  been  found  to  have  sufficient  speed  and  capacity  for  all  but  the  most 
demanding  experimentation  conditions.     In  addition,   they  allow  a  greater  degree  of  control 
over  task  access  to  the  queues  as  well  as  facilitating  error  detection  and  recovery 
processing.     The  dedicated  queue  approach  is  kept  in  reserve  for  those  experiment  designs 
requiring  very  high  data  volumes  and  rates. 
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Abstract 

A  fault  tolerant  design  used  to  enhanced  the  survivability  of  a  distributive 
processing  system  is  described.  Based  on  physical  limitations,  mission  duration  and 
maintenance  support,  the  approach  has  emphasized  functional  redundancy  in  place  of  the 
traditional  hardware  or  software  level  redundancy.  A  top  down  architecture  within  the 
system's  hierarchy  allows  sharing  of  common  resources.  Various  techniques  used  to 
enhance  the  survivability  of  the  hardware  at  the  equipment,  module  and  component  level 
were  analyzed.  The  intent  of  the  on  going  work  is  to  demonstrate  the  ability  of  a 
distributive  processing  system  to  maintain  itself  for  a  long  period  of  time. 

Introduction 

Marine  Corps  communication  systems  have  a  strong  need  for  increased  survivability  and 
simplified  maintenance.  Survivability  can  easily  be  measured  during  a  field  operation, 
where  the  effectiveness  of  Command  and  Control  (C^)  is  tested.  During  the  operation, 
the  performance  of  the  C^  system  is  heavily  dependent  on  data  communication. 
Information,  in  the  form  of  data  from  sensors  and  commands  to  field  elements  is 
continuously  being  transmitted  to  and  from  various  unit.  When  being  routed  from  source 
to  destination  this  information  may  pass  through  several  trunk  lines  linked  together  by 
switching  nodes  within  a  network.  Loss  of  a  trunk  line  due  to  the  failure  in  a  switch 
would  severely  cripple  the  performance  of  the  network  and  in  the  process  cut  off 
communication  to  field  resources.  In  addition,  with  the  limited  manpower  available  in 
the  field,  it  is  a  difficult  task  to  repair  the  unit  and  return  it  to  operation.  System 
designers  are  facing  the  problems  of  (1)  how  to  make  sensitive  electrical  systems 
operate  in  a  hostile  environment  and  (2)  how  to  make  these  systems  tolerate  failure  in 
the  field. 

A  solution  to  these  problems  is  the  goal  of  this  fault  tolerant  research  and 
development  program.     The  basic  objectives  of  the  program  are  twofold: 

1.  Demonstrate   that   electronic   systems   which    incorporate    fault   tolerant  design 
techniques  can  greatly  improve  operational  survivability. 

2.  Demonstrate    that   electronic   systems   can  be  built  which  can  reduce  technical 
and  logistic  support  at  the  tactical  level. 

The  present  trend  is  towards  using  microcomputers  in  the  design  of  complex  and 
sophisticated  military  systems.  The  primary  effort  of  this  program  concentrates  on 
developing  fault  tolerant  techniques  which  can  be  incorporated  in  these  systems.  In 
this  manner  the  Marine  Corps  focuses  on  developing  technology  which  can  be  used  in  a 
multitude  of  different  applications.  Although  it  ie  feasible  to  support  development  of 
a  fault  tolerant  capability  in  other  types  of  systems,  (e.g.,  Analog  System)  general 
cost,  benefits  and  other  special  concerns  may  not  offset  the  increase  in  resources 
needed  to  achieve  these  objectives. 

In  developing  a  fault  tolerant  system,  improved  operational  availability  is  achieved 
in  the  design  methodology.  Errors  in  a  system  are  detected  and  corrected  by  either  a 
hardware  or  software  algorithm.  Fault  Tolerance  also  involves  elimination  of  design 
faults  in  the  operational  system.  For  the  purpose  of  this  document,  these  design  faults 
are  assumed  to  have  been  eliminated.  Attention  will  focus  on  tolerating  three  types  of 
faults:  1)  Permanent  component  failure  (hard  error);  2)  Intermittent  component  failure 
(soft  error);  3)  External  interference. 

Approach 

The  approach  entailed  building  a  demonstration  system  for  collecting  data  to  evaluate 
system  performance  enhancements  achieved  by  fault  tolerant  techniques.  The  operational 
requirements  for  the  demonstration  system  were  taken  from  the  specification  of  a  Marine 
Corps  mobile  store  and  forward  message  switch.  The  following  fault  tolerant 
enhancements  were  added  to  these  specification: 
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1.  Improved  Survivability  (carryout  the  mission  of  the  switch  for  one  hundred  and 
twenty  days,   an  improvement  over  current  requirements  by  a  factor  of  ten) 

2.  Maintain  Transportability  (considering  the  available  resources  during  a  field 
operation  minimize  increases  in  weight,  size  and  power  which  are  often  the  price  in  a 
fail  safe  design). 

Data  will  be  collected  and  analyzed  from  the  operation  of  this  system  test  bed  for  the 
purpose  of : 

1.  Determining  the  success  of  maintaining  throughput  given  various  levels  of  system 
degradati  on. 

2.  Investigating  the  feasibility  of  minimizing  operator  intervention  in  system 
maintenance  through  embedded  diagnostics  and  improved  man-machine  interface  . 

3.  Providing  information  on  the  system  requirements  for  individual  software 
performance  modules,  associated  hardware  requirements  and  software  development  cost 
associated  with  the  development  of  a  fault  tolerant  system. 

System  design 

A  fault  tolerant  system  means  that  the  system  has  the  ability  to  execute  specific 
algorithms  correctly  regardless  of  hardware  or  program  failures.  To  include  this 
feature,  a  system  designer  provides  parallel  paths  for  critical  functions  either  in  the 
form  of  redundancy  or  load  sharing  .  In  search  of  the  best  design  to  satisfy  the  Marine 
Corps  fault  tolerant  goals  and  objectives,  the  following  techniques,  as  summarized 
below,   were  considered. 

Hardware  duplication 

Here,  The  system  function  is  safeguarded  by  maintaining  two  identical  pieces  of 
equipment,  one  is  active  the  other  an  off-line  back  up.  The  disadvantages  in  this 
method  are  (1)  increased  weight  and  size  (i.e.,  two  pieces  of  hardware  are  required  to 
do  the  job  of  one)  and  (2)  reduced  operational  time  (The  system  remains  down  until  the 
back  up  is  powered  up  and  switched  on-line). 

Line  redundancy 

A  technique  for  interconnecting  critical  functions  within  a  system.  This  method 
involves  converting  parallel  lines  to  serial  as  a  means  of  providing  alternate 
communications  paths.  With  this  technique  to  maintain  a  system  throughput  requires  a 
significant  increase  in  bandwidth.  The  additional  bandwidth  is  the  cost  for  converting 
address,   data  and  control  lines  to  a  serial  format. 

Majority  voting 

This  is  a  classical  method  used  for  detecting  and  correcting  processor  errors  on  the 
fly.  The  process  involves  members  of  an  ensemble  to  simultaneous  compare  results  to 
determine  correctness.  Although  a  powerful  aid  in  assuring  reliable  calculation  in  real 
time,  this  advantage  is  lost  in  the  additional  hardware  and  software  required  to  form  a 
majority  vote  among  the  members  in  an  ensemble. 

Functional  redundancy 

This  technique  takes  into  account  the  ability  of  a  general  purpose  microcomputer  to 
function  as  a  programmable  logic  machine.  This  property  of  a  microcomputer,  when 
developed  in  a  distributive  processing  architecture,  provides  the  system  with  an 
inherent  form  of  redundancy.  In  a  functional  approach,  redundancy  is  achieved  by  being 
able  to  reconfigure  the  resources  at  different  levels  within  the  system.  This  involves 
a  top  down  design  architecture  which  allows  the  sharing  of  common  resources  within  the 
following  levels: 

a.  System  (group  of  equipments,  including  any  required  operator  functions,  which  are 
integrated  to  perform  a  related  operation). 

b.  Equipment  (a  unit  which  performs  an  operational  function  and  is  capable  of 
independent  operation  in  a  variety  of  situations). 

c.  Module  (a  combination  of  components  which  has  limited  application  or  use  on  its 
own  but  is  essential  for  the  completeness  of  the  more  complex  item). 
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d.  Component  (the  smallest  electronic  device  in  the  system  (e.g.,  integrate 
circuits,  etc.)  with  terminals  that  may  be  directly  connected  to  other  electronic 
devices) . 

The  actual  hardware  or  software  redundancy  is  distributed  throughout  the  system 
instead  of  localized  at  single  level.  Since  this  method  only  requires  improving  the 
operational  efficiency  of  a  microcomputer  system  as  the  means  of  achieving  improved 
survivability,  functional  redundancy  was  selected  as  the  best  candidate  for 
demonstrating  fault  tolerant  enhancements. 

System  configuration 

The  primary  mission  of  the  message  switch  is  to  furnish  continuous  communication 
service  to  designated  subscribers  (users)  within  a  network.  Messages  are  inputed  to  the 
switch  and  routed  to  their  destination.  This  process  involves  three  distinct 
operations:  inputing  and  outputing  messages  (Line  Terminator  Unit),  processing  the 
message  (Processor)  and  temporary  storage  for  bookkeeping  (Memory).  Table  1  summarizes 
the  various  tasks  involved  with  transmitting  a  message  through  the  network  and  the 
associated  distribution  of  these  tasks. 

TABLE  1 
TASK  DISTRIBUTION 

INPUT/OUTPUT  MEMORY 

MESSAGES  ENTER/EXIT  TEMPORARY  STORAGE 

SYSTEM 

PROGRAM  STORAGE 

RECOGNIZES  MESSAGE 
FRAMES 


CHECKS  FOR  TRANSMISSION 
ERRORS 


INPUT  PROCESSOR  CONTROL  PROCESSOR  OUTPUT  PROCESSOR 


CONTROLS  RECEPTION  OF  DYNAMIC  BUFFER  PREPARES  MESSAGE  FOR 

MESSAGES  ALLOCATION  OUTPUT 

CHECKS  FOR  FORMAT  MESSAGE  LOGGING  INITIATES  TRANSMISSION 

ERRORS 

MESSAGE  ROUTING  MONITORS  SENDING 


MESSAGE  QUEUES 


PROCESS 


The  goal  of  a  fault  tolerant  design  is  to  avoid  those  conditions  which  result  in  a 
mission  failure.  This  is  achieved  within  the  network  (System)  by  providing  redundant 
circuits  for  critical  paths  and,  when  applicable,  load  sharing  of  available  resources. 
In  the  functional  approach  a  trade  off  is  performed  between  what  redundancy  is  needed 
and  the  additional  resources  offered  at  various  levels  within  the  system.  The  scope  of 
the  on  going  work  is  limited  to  those  techniques  applicable  at  the  local  level 
(equipment).  However,  in  the  case  of  the  Line  Terminator  Unit  (LTU),  which  forms  the 
boundary  between  local  and  network  elements,  the  assets  offered  by  the  network  cannot  be 
completely  ignored.  As  an  example,  to  patch  around  a  line  failure  at  a  switch,  the 
network  would  route  a  message  on  alternate  path.  This  involves  either  developing  or 
improving  a  high  level  interface  to  accommodate  (1)  a  distributive  network  supervisor 
(2)  load  sharing  and  reconfiguration  of  trunk  line  and  (3)  monitoring  and  status 
reporting  within  the  network.  This,  in  turn,  would  alleviate  the  hardware  requirement 
of  the  switch  to  include  a  one  for  one  back  up  for  all  the  access  lines.  Taking 
advantage  of  a  software  interface  saves  the  switch  additional  weight,  size  and  power 
necessary  to  support  additional  lines. 

As  mentioned  previously,  the  primary  mission  of  the  switch  is  to  provide  continuous 
communication  service  to  at  least  three  local  users  of  the  network.  Therefore  based  on 
an  elementary  network  configuration,  required  maintenance  support,  estimated  system 
throughput,  and  loading  of  trunk  lines,  each  switch  at  a  remote  site  is  required  to 
sustain  at  least  six  of  the  specified  twelve  access  lines  without  operator 
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intervention.  The  six  lines  allow  the  switch  to  accommodate  three  user  terminals  and 
three  trunk  lines  for  message  routing.  This  provides  the  guideline  for  defining  a 
successful  mission.  The  switch  will  be  considered  to  have  failed  when  it  fails  to 
electrically  reconfigure  itself  to  maintain  six  access  lines.  With  the  mission 
objective  (switch),  duration  (120  days)  and  success  having  been  defined,  the  following 
describes  the  techniques  used  to  improve  the  survivability  of  the  switch. 

Switch  architecture 

The  switch  architecture  is  illustrated  in  Figure  1.  A  dual  bus  structure  was 
utilized  for  satisfying  the  internal  data  exchange  requirements.  The  system  bus 
provides  the  means  of  resource  sharing  between  modules  (loosely  coupled  system),  while 
the  fault  tolerant  bus  operates  independently  and  provides  the  communication  channel 
necessary  to  monitor  and  manage  the  equipment's  resources.  By  keeping  the  fault 
tolerant  functions  separate,  system  throughput  is  not  sacrificed  and  existing  systems 
can  be  modified  to  include  fault  tolerant  capabilities. 


OPERATOR 
CONSOLE 


DEDICATED  ACCESS  CHANNELS 


■12 


LINE  TERMINATION  UNIT 


SYSTEM  DATA  BUS 


SYSTEM 
MEMORY 


PROCESSOR 
1 


PROCESSOR 
2 


PROCESSOR 
3 


PROCESSOR 
4 


FAULT  TOLERANT/MONITOR  BUS 


FIGURE  1 
SWITCH  ARCHITECTURE 


Equipment  techniques 

Bus 

The  Intel  Multibus  structure  was  selected  for  the  system  bus  protocol.  In  this  Bus 
architecture,  bus  contention  is  resolved  on  a  priority  bases.  The  priority  of  a  module 
is  determined  by  the  interconnection  method  used  in  the  back  plane.  Presently  there  are 
three  common  methods  for  interconnecting  the  priority  circuitry  and  of  these  three 
methods  (daisy  chain,  random,  and  parallel).  The  parallel  configuration  with  a  hardware 
time  out  was  selected  as  the  most  fail  safe.  In  the  daisy  chain  method  a  failure  in  any 
link  electrically  removes  all  Masters  (modules  which  have  the  capability  of  controlling 
data  exchanges  on  the  Bus)  with  lower  priority  from  the  Bus.  The  random  select  method 
requires  additional  circuitry  which  added  additional  points  of  failure. 
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The  fault  tolerant  bus  consists  of  spare  data  lines,  spare  address  lines,  module 
identification,  I/O  Bus,  and  a  serial  communication  channel.  The  system's  status  and 
control  information  exchang  es  between  modules  is  in  the  MIL— STD  1553B  serial  format  over 
this  communication  channel.  This  alleviates  fault  tolerant  signalling  from  delaying 
data  exchanges  over  the  system  bus.  The  fault  tolerant  control  module  acts  as  the 
serial  bus  controller  with  all  the  other  modules  connected  to  the  bus  acting  as  remote 
termi  nals . 

Controller 

The  fault  tolerant  control  module  functions  consist  of  software  driver  error  and 
fault  identification,  isolation,  connection,  and  system  recovery  methods  coordinated 
with  hardware  fault  tolerant  procedures  designed  to  keep  the  switch  operational, 
regardless  of  the  hardware  state.  This  controller  acknowledges  error  or  fault  if  the 
error  or  fault  is  handled  soley  on  an  operational  module.  The  controller  initiates  any 
response  that  causes  alterations  of  system  memory  or  processor  reconfiguration.  This 
insures  minimal  disturbance  to  the  message  traffic  flow  while  the  system  adjusts  to 
compensate  for  the  failure. 

The  controller,  while  monitoring  the  system,  performs  the  necessary  fault 
verification,  isolation  and  recovery  to  maintain  the  operation.  Fault  verification 
occurs  when  one  of  the  processor  modules  suspects  fault  condition.  The  controller 
attempts  to  verify  a  fault  or  error,  adjust  the  system  to  eliminate  the  suspected  fault 
and,  isolate  its  cause.  After  accomplishing  this  the  controller  performs  the  necessary 
procedures  required  to  insure  non-recurrence  of  the  fault  condition. 

Module  techniques 

Input/output 

From  the  definition  of  mission  success,  there  are  essentially  two  considerations  in 
the  I/O  design.  The  first  is  to  eliminate  any  failure  condition  which  would  result  in 
losing  more  than  six  of  the  system's  twelve  available  access  lines  and  the  second  is  to 
preserve  the  local  user  lines  to  the  system.  Any  component  failure  in  the  I/O  could 
produce  either  of  two  possible  results  (1)  the  system  lose  a  complete  module  or  (2)  it 
loses  a  single  access  line.  The  design  goal  is  to  minimize  the  adverse  effect  on 
performance  of  losing  up  to  six  lines  at  a  time  and  also  guarantee  that  the  system  never 
loses  an  access  line  to  one  of  the  subscribers. 

To  fulfill  these  design  goals,  the  external  interface  portion  of  the  tweleve  access 
lines  is  partitioned  equally  into  two  modules  with  an  I/O  bus  interconnecting  the  two 
modules.  Each  module  is  self  contained  and  can  completely  service  six  traffic  lines  and 
the  operator  terminal.  The  six  traffic  lines  are  further  subdivided  electrically  on  the 
module  into  three  primary  and  three  secondary  lines.  Under  software  control  the  three 
primary  lines  can  override  the  connections  on  the  secondary  lines.  In  this  manner, 
losing  a  module  would  only  reduce  the  number  of  serviceable  lines  to  the  minimum 
required  by  the  switch,  and  secondarily  each  subscriber  would  be  provided  at  least  one 
back  up  line. 

Processor 

The  first  step  is  to  determine  if  there  is  a  failure  in  the  processor.  This  requires 
a  method  to  test  both  the  instructions  sequence  and  also  the  data  manipulations.  In 
either  case  there  are  two  ways  for  detecting  and  one  for  correcting  the  failure.  In  the 
first  method  the  error  is  detected  by  executing  a  predetermined  sequence  of  instruction 
(diagnostic)  at  a  prearranged  time  with  a  standard  data  word,  then  comparing  or 
reporting  the  results  of  the  operation  with  that  of  another  processor.  The  second 
method  for  detecting  an  error,  involves  the  use  of  internal  status  signals  and  flags  to 
continuously  monitor  the  performance  of  the  processor.  When  a  failure  is  detected,  the 
processor,  under  software  control,  goes  into  the  diagnostic  sequence.  When  the  failure 
is  determined  to  be  permanent  and  the  processor  is  on-line,  it  would  then  be  replaced 
with  a  spare  and  service  would  be  resumed. 

Memory 

Programs,  intransit  calculations  and  message  information  are  stored  in  memory. 
Memory  is  exercised  at  least  once  in  every  instruction  cycle  of  the  computer.  Memory  is 
probably  the  most  active  module  in  a  computer  system  and  also,  unfortunately,  the  least 
reliable.  Any  error  in  memory,  whether  internal  (memory  cell),  external  (interface), 
intermittent,  or  permanent,  will  adversely  affect  the  operation  of  the  computer.  Any 
improvements  that  can  be  made  will  significantly  upgrade  the  reliability  of  the  switch. 
Analysis    of    the    types    of   faults   encountered    indicate    that    two   methods   can   be   used  to 
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provide  fail  safe  operation.  The  first  is  a  memory  reconfiguration  approach,  while  the 
second  is  error  detecting  and  correcting  encoding  and  decoding. 

Memory  essentially  is  comprised  of  three  major  elements;  address  decoding,  storage 
and  access  lines.  Any  one  of  these  three  elements  represents  a  single  point  of  failure 
in  memory.  In  the  reconfiguration  approach,  primary  and  backup  means  are  provided  for 
addressing  memory.  The  primary  method  consists  of  a  physical  hardware  address  where  the 
addressing  information  is  contained  in  the  identification  number  on  the  fault  tolerant 
bus.  In  contrast  to  this,  the  back  up  method  is  defined  by  a  software  driver  logical 
address.  Under  this  scheme,  switches,  under  software  control,  configure  the  memory 
address  of  the  module.  This  enables  the  system  to  disable  a  faulty  memory  section  and 
patch  around  it.  When  a  section  fails  the  primary  address  is  disabled  and  a  spare 
section  is  activated  with  the  address  of  the  faulty  section. 

The  second  method  is  a  real  time  fault  detection  and  correction  implemented 
completely  by  hardware.  A  modified  Hamming  error  detecting  and  correcting  technique 
compensates  for  errors  in  memory  and  failures  in  the  data  or  address  lines.  The  parity 
encoding  is  done  at  the  data  source  and  the  decoding  is  at  the  data  sink.  In  this 
manner  the  spare  data  and  address  lines  on  the  fault  tolerant  bus  provide  redundant  data 
paths  to  compensate  for  line  failures,  while  spare  memory  cells  compensate  for  storage 
failures . 

Component  techniques 

The  requirement  for  long  uninterrupted  periods  of  reliable  operation  can  be  met  by  a 
complementary  non  redundant  approach.  Here,  the  strategy  is  to  screen  and  eliminate 
manufacturing  defects  before  they  enter  the  system.  Since  total  elimination  of  these 
defects  is  not  possible,  in  practice  the  goal  is  then  to  reduce  the  number  to  an 
acceptable  value.     Some  examples  are: 

1.  The  most  reliable  components  are  selected  for  the  system; 

2.  Very  Large  Scale  Integration  (VLSI)  is  designed  to  provide  redundant  circuitry 
within  the  component; 

3.  The  system  package  is  designed  to  eliminate  external  interferences; 

4.  The  system  is  allowed  sufficient  burn  in  time  to  eliminate  infant  mortality 
failures . 

Summary 

The  object  of  the  fault  tolerant  program  was  to  demonstrate  improved  operational 
survivability  while  minimizing  increases  in  a  system's  physical  dimensions.  Although 
there  are  a  multitude  of  methods  available  to  satisfy  the  projected  military  requirement 
in  the  area  of  survivability,  very  few  are  applicable  in  Marine  Corps  systems  because  of 
the  additonal  hardware  required  to  develop  the  capability.  In  the  functional  approach, 
redundancy  is  evenly  distributed  throughout  the  system's  hierarchy  instead  of  focusing 
it  at  a  particular  level.  What  makes  the  functional  approach  appealing  in  a 
distributive  processing  system  is  that  it  takes  advantage  of  the  characteristic  of  the 
microcomputers  within  the  system  to  share  loads.  This  property  provides  the  system  an  a 
inherent  form  of  redundancy.  The  system  described  in  this  paper  is  presently  in  the 
debugging  stage.  A  summary  of  the  additional  hardware  and  software  support  needed  to 
satisfy  the  survivability  problem  of  the  Marine  Corps  by  the  function  approach  is  listed 
in  Table  2.  Approximately  35%  additional  firmware  (hardware/software)  was  built  into 
the  system  to  achieve  a  10  to  1  improvement  in  survivability.  This  is  much  less  than 
the  additional  systems  required  by  the  duplication  and  majority  voting  methods.  Upon 
completion  of  the  survivability  testing,  additional  work  is  needed  in  the  area  of 
operational  and  reliability  modeling  along  with  continuing  the  ongoing  performance  test 
to  satisfactorily  complete  this  task. 
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TABLE  2 

SYSTEM  DEVELOPMENT  NEEDS  AND  HARDWARE  ADDITIONS 


DEVELOPMENT  NEEDS 


HARDWARE  ADDITIONS 


DISTRIBUTIVE  NETWORK  CONTROL  ALGORITHM 


PROCESSOR  TO  FUNCTION  AS  A  SYSTEM 
CONTROLLER 


EMBEDDED  MAINTENANCE  SUPPORT  SOFTWARE 


FAULT  TOLERANT  BUS  (CONNECTOR) 


SYSTEM  SUPPORT  SOFTWARE  (CONTROLLER) 


SWITCH  CIRCUITRY  BUILT  INTO  VLSI 
COMPONENTS 


MEMORY  CIRCUITRY  FOR  FAULT  DETECTION 
AND  CORRECTION 


SUPPORT  CIRCUITRY  BUILT  INTO  VLSI 
COMPONENTS 


MEMORY  BITS  TO  ACCOMODATE  EXTRA 
PARITY  BITS 


MEMORY  BITS  TO  ACCOMODATE  MEMORY  PATCH 


REDUNDANT  CIRCUITRY  BUILT  INTO 
VLSI  ARCHECTITURE 


FAULT  TOLERANT  BUS  INTERFACE  CIRCUITRY 


COMPATIBLE  INTEGRATED  CIRCUIT 
TECHNOLOGY 


I/O  BUS  INTERFACE  CIRCUITRY 


MEMORY  SWITCH  CIRCUITRY 
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Specialized  logic  to  accomplish  image  processing  has  been  available  since  the  early 
1960' s.     Systems  like  CELLSCAN,  GLOPR,   and  diff3  have  the  capability  to  perform,  (respec- 
tively)  at  least  1 0 ^ ,    10^  and  10^  picture  point  or  pixel  operations  per  second  using  a 
local  organization  of  inputs  to  each  gate.     This  kind  of  image  processing  system  has  been 
known  as  cellular  logic.     The  term  goes  back  to  the  early  days  of  computers  through  the 
work  of  von  Neumann    [1,2]   and  Moore    [3]   on  automata;   a  recent  survey  paper  co-authored  by 
one  of  us    [4]   discusses  cellular  logic  and  applications  in  medical  image  processing.  Neigh- 
borhood processing  is  a  similar  term  used  to  describe  a  system  with  pipelining  added  to  con- 
serve the  number  of  gates  needed;   see  [5]. 

Costs  of  digital  circuits  are  rapidly  declining  rendering  practical  systems  employing 
cellular  logic  and  other  types  of  parallelism;   such  systems  are  well-suited  to  image  and 
sensor-array  signal  analysis.     Three  recent  systems  CLIP1  and  DAP  produced  in  the  United 
Kingdom,   and  MPP,  being  built  in  the  U.S.A.    (Goodyear  Aerospace,   Ohio),   show  use  of  these 
principles  and  attain,   respectively,   rates  of   10^,    1010,   and  10* 1   pixel  operations  per 
second. 

Very  large  scale  integration   (VLSI)    technology  is  well-suited  to  implementation  of  the 
image  processing  logic,   shift  registers,   and  cellular  algorithms  present  in  the  systems 
described  above.     Use  of  such  technology  for  detection  of  tracks  from  a  sequence  of  images 
is  the  focus  of  a  study  we  are  conducting  at  The  Aerospace  Corporation. 

In  particular,   this  paper  concerns  applying  three-dimensional  cellular  logic  to  track 
detection.     In  the  following  sections  we  describe  the  nature  of  cellular  computers,  relating 
them  to  the  diverse  set  of  digital  devices  available  to  day.     Then  in  succession  we  present 
the  key  concepts  of  three-dimensional  cellular  logic  and  then  the  track  detection  experi- 
ments we  have  conducted.     The  conclusions  then  provides  recommendations  regarding  this  effort. 

Cellular  computers 

The  term  cellular  automaton  was  coined  by  von  Neumann  to  designate  a  situation  where 
each  processing  element  in  an  array  is  connected  to  its  neighbors  and  where  state  change 
occurs  depending  on  the  nearby  values.     The  commercial  devices  recently  marketed  possess 
this  property  and  also  have  full  computational  power  at  each  mode.     Thus  the  above-mentioned 
DAP  and  MPP  are  true  cellular  computers .     The  concept  of  a  cellular  logical  processor 
differs  from  either  of  these  in  two  respects.     First,   only  binary  operations  take  place  at 
each  mode,   hence  the  term  "logical".     Second,   data  entry  is  via  a  general  purpose  digital 
computer:   the  array  of  processing  elements  acts  analogously  to  a  FFT    (fast  Fourier  trans- 
former)  or  other  specialized  peripheral  device.     The  class  of  cellular  logic  processors 
includes  the  CLIP  machines,   characterized  by  being  array  systems    (CLIP4  is  a  96  x  96  array); 
the  Environmental  Research  Institute  of  Michigan    (ERIM)   Cy to-computer ,   using  pipelining 
to  cycle  an  image  through  fewer  processing  elements;   and  GLOPR,   a  subarray  system  based  on 
the  notion  of  subfield,   elaborated  below  and  extended  to  three-dimensions  here.     The  acronym 
"GLOPR"  stands  for  Golay  Logical  Processor    [6] :   the  key  notion  introduced  by  Golay  was  that 
a  tiling  of  the  plane  by  edge-adjacent  hexagons  is  beneficial  due  to  the  absence  of  the  two 
different  kinds  of  neighbors,   edge  and  corner,   found  in  arrays  of  squares. 

The  hexagonal  decomposition  of  the  place    (tessellation)    is  well-suited  to  cellular  logic. 
Because  all  hexagon  cell  neighbors  are  edge-adjacent  simpler  algorithms  that  preserve 
connectivity  can  be  devised.     Subfields  are  used  to  divide  the  cells  in  the  hexagonal 
tessellation  into  three  disjoint  sets:   elements  in  each  do  not  touch  other  members  of  the 
same  set.     Hence  three  clock  pulses,   each  causing  a  change  in  only  one  third  of  the  cells 
can  accomplish  a  connectivity-preserving  algorithm. 

Golay 's  fourteen  "surrounds"  are  used  in  developing  a  logical  transform.     These  distinct 
patterns  of  zeros  and  ones  can  be  used  to  implement  global   "shrinking"  algorithms  that 
reduce  shapes  to  single  points    (residues)    that  persist  through  subsequent  stages.  The 
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methods  for  adapting  these  procedures  that  were  developed  by  Golay  were  recently  shown 

by  one  of  us    [7]    to  be  inferior  to  an  orientation-independent  variant.  Finally,  the 

extension  of  this  variant  into  three-dimensions,   i.e.,   the  partition  of  space  into  volume 

zones  involving  regular  solids  becomes  the  basis  of  the  track-detection  algorithm  we  have 
been  studying. 

Three-dimensional  cellular  logic 

The  neighbor  cells  in  a  three-dimensional  hexahedral  space  tessellation  are:    1)  six 
elements  in  the  same  plane    (time)    as  the  central  one;   2)    three  elements  in  a  prior  plane; 
and  3)    three  elements  in  a  subsequent  plane.     All  twelve  neighbors  form  a  tetradecahedron 
(Figure  1):   this  entity  with  its  central  element  is  called  the  kernel.     These  twelve  neigh- 
boring cells  and  the  central  one  are  numbered  two  ways.     The  sequential  numbering  shows 
how  they  are  counted  in  adjacent  time  planes;   i.e.,    1   through  3,   then  4  through  10,  and 
last  11   through  13.     Within  the  mid-plane,   cell  7  is  the  center  of  the  three  lines  bounded 
by    (4,    10),    (5,   9)    and    (6,   8)   respectively;   it  is  also  a  central  cell  in  the  solid  with 
extremes  in  prior  and  subsequent  planes;   i.e.,   lines  with    (1,    13),    (2,    12),   and   (3,  11) 
as  end  points.     The  second  numbering  scheme  shows  decomposition  of  the  kernel  into  six 
subf ields . 

In  the  plane  subfield  decomposition  with  either  hexagons  or  squares  in  the  tessellation, 
choosing  the  minimal  number  of  non-adjacent  cells  leads  to  either  three    (hexagons)   or  four 
(squares)   distinct  subf ields.     In  three-dimensions,   an  analogous  process  takes  place.  The 
numbers  shown  in  Figure  1   that  are  1   through  6  give  a  six-subfield  partition  of  the  thir- 
teen Figure  1   elements.     In  general,   subf ields  are  used  whenever  an  algorithm  performs 
analysis  which  is  dependent  on  the  connectivity  relationship  between  cells.     The  portion  of 
the  track  detection  algorithm  that  reduces  connected  chains  of  cells  containing  binary  1's 
to  residues  requires  the  use  of  subf ields.     Another  example  of  use  of  subf ields  in  planar 
systems  is  for  skeletonizing    (medial  axis  transform)  algorithms. 

Skeletonization  is  the  operation  that  yields  an  interior  line  structure  whose  elements 
are  equidistant  from  at  least  two  distinct  boundary  points.     Using  cellular  logic  to 
implement  skeletonization  is  both  parallel  and  omnidirectional  and  hence  highly  efficient. 
The  original  work  on  skeletonization  had  the  purpose  of  reducing  planar  shape  representa- 
tions to  linear  structures.     Recently  this  was  extended  to  three-dimensional  shapes  (8). 

Track  detection 

Target  tracks  are  inherently  three-dimensional  entities.     Nevertheless,  many  prior 
procedures  to  detect  tracks  examined  planes  in  x-y  coordinates,   and  after  limiting  the 
potential  track  points  to  relatively  few  locations,   combined  the  data  from  successive 
planes  obtained  in  a  time-sequence.     Although  this  procedure  has  been  successful  in  many 
situations,   it  is  of  limited  value  today. 

First,   availability  of  low-cost  digital  devices  changes  the  computing  economics 
radically:   it  is  now  possible  to  gain  important  detection  capabilities  by  building 
spatial  processors  in  place  of  planar  arrays.     Second,  detection  is  of  greater  time-urgency 
in  a  ballistic  missile  environment  than  in  a  propeller-powered-airplane  air-defense  mode. 
Finally,   data  sources  and  platforms  are  in  use  which  give  large-volume  low-quality  data, 
causing  a  strong  need  for  high  speed  processing.     This  can  be  attained  with  three-dimensional 
cellular  logic. 

Problem  explanation.     An  array  of  infrared  photodiodes  generates  a  spatial  pattern  of 
electrical  signals  which  evolves  over  time.     How  can  this  data,  which  is  inherently  very 
noisy  and  of  low  quality,   best  be  used  to  detect  targets,   locate  the  tracks  they  make,  and 
resolve  crossing  or  nearby  tracks? 

In  our  experiments  simulated  IR  data  was  used  that  was  generated  by  D.  McAllister  at 
The  Aerospace  Corporation:   analog  values,   i.e.,   real  numbers  were  obtained.     This  data 
came  from  CDC  7600  FORTRAN  programs.     An  array  32  x  32  with  230  time  values  was  the  output 
of  these  programs.     After  expansion  by  replication  to  64  x  64    (in  x  and  y-coordinates )  and 
retention  of  only  the  first  64  times  slots  of  the  230,  we  obtained  the  baseline  target 
data.      (Replication  was  necessary  to  interface  with  programs  available  at  the  University 
of  Pittsburgh  Biomedical  Image  Processing  Unit,   to  perform  three-dimensional  logical  trans- 
formation.    A  Perkin-Elmer  3230  computer  there  with  a  FORTRAN  program  package  called 
"TRIAKIS"  and  developed  at  Carnegie-Mellon,  was  used.     Note  that  the  displays  below  showing 
points  as  2  x  2  arrays  of  spheres,  were  required  by  this  replication.)    See  Figure  2. 

Algorithms .     The  overview  of  the  information  flow   (and  the  cellular  logic  program)    for  the 
algorithm  to  detect  tracks  using  the  three-dimensional  logical  transform  is  given  by 
Figure  3.     The  input  data  there  is  an  array  of  size  64  x  64  x  64  in  x-y-t.     All  the  points 
in  the  array  are  processed  simultaneously  by  the  algorithm.     This  was  done  by  the  TRIAKIS 
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algorithm  at  the  University  of  Pittsburgh,   although  these  programs  are  now  running  and 
available  at  The  Aerospace  Corporation,   through  the  work  of  D.   Conti.     Note  that  TRIAKIS 
emulates  a  cellular  computer  that  would  do  simultaneous  computing  on  a  64  x  64  x  64  data 
array.     The  actual  operation  of  the  algorithm  that  locates  tracks  is  discussed  in  the 
remainder  of  this  section. 

To  use  the  logical  transform,   the  analog  input  data  is  first  thresholded.     The  threshold 
is  chosen  sufficiently  high  that  chains  of  connected  ones  are  probable  in  the  vicinity  of 
target  points.     These  constitute  the  shapes  that  are  skeletonized.     At  high  threshold  values 
chains  are  improbable  in  pure  noise  regions.     By  first  skeletonizing  the  chains  and  then 
reducing  the  results  to  single  points  the  three-dimensional  image  processing  computer 
generates  the  track  residues.      (The  reverse  operations  to  reducing  or  shrinking  is  called 
augmenting . ) 

IR  targets  in  noise  were  detected  by  skeletonization  of  the  connected  regions  in  the 
three-dimensional  spatio-temporal  domain  using  three-dimensional  logical  transforms.  The 
actual  workspace  used,   a  64  x  64  array,   simulates  the  target  history  over  64  timeslots 
for  a  4096  diode  array.      (The  binary  or  logical  version  of  this  array  is  called  the  field. ) 
The  symbol^  indicates  the  number  of  ones  that  are  present  without  considering  connectivity. 
Skeletonization  used  the  algorithm  of    (7)   where  it  was  found  that  high  values  of ^  must  be 
avoided  in  order  to  prevent  the  formation  of  rings;   i.e.,   a  closed  figure  which  cannot  be 
skeletonized.      (Empirically    (7) ,   rings  are  likely  to  form  for  values  of ^  greater  than  6 
and  7 . ) 

If  a  target  track  occurs  in  white  Gaussian  noise,   it  will  be  retained  for        =  6 
provided  six  or  more  surrounding  ones    (from  noise)   occur.     Calculation  of  the  binomial 
coefficient,   the  combination  of  12  things  taken  6  at  a  time,   as  the  probability  of  a  binary 
1  varies  according  to  the  Gaussian  distribution  approximates  the  probability  of  six  or  more 
surrounding  ones.     This  varies  most  rapidly  when  the  probability  of  binary  1   is  a  few 
percent.     The  likelihood  of  one  solely  due  to  noise  was  set  to  0.03  by  choice  of  the 
threshold.     The  results  are  shown  in  Figure  2.     This  data  was  operated  upon  by  the  skele- 
tonizing algorithm.     The  crossing  number  for  ones  X-|   is  set  to  4  and  the  crossing  number  for 
zeros  Xq  is  set  to  the  "don't  care"  value  of  9.      (If  a  neighborhood  of  a  cell  has  n  groups 
of  one  neighbors,   the  crossing  number  for  ones  is  2n.     Hence  X-i  >  2  indicates  that  the  center 
element  connects  two   (or  more)   groups  of  ones.     Xq,   the  crossing  number  for  zeros,  which 
is  independent  of  the  number  of  ones  in  the  three-dimensional  case,   is  twice  the  number  of 
connected  groups  of  zeros.) 

Labeling  the  three-dimensional  array  from  the  residues  obtained  is  accomplished  by  taking 
the  binary  complement  of  the  field  of  residues,   augmenting  each  residue  into  a  kernel, 
again  taking  the  complement,  and,   finally  ANDing  with  the  original  logical  array.  The 
operation  is  continued  until  no  further  increase  in  the  labeled  region  occurs.     This  is 
shown  in  Figure  3. 

The  computing  time  is  approximately  the  same  number  of  iterations  as  required  by  the 
original  skeletonization.      "Labeling"  refers  to  region-growing  using  the  residue  as  a  seed. 
The  result  is  partitioning  the  original  field  into  nontrack  and  distinct- track  domains. 
The  details  of  residue-finding,   the  core  of  the  procedure,  follows. 

In  each  cycle  six-clock-pulses  cause  the  logic  to  generate  skeletons  and  reduce  them  to 
residues.     Cells  in  the  six  subfields  are  activated  in  the  sequence  1-6-3-4-2-5.     If  a  cell 
is   "zero",   its  value  is  left  unchanged.     If   "one"  and  six  or  more  cells  in  its  12-cell 
neighborhood  are  "one";   or  if  there  are  two  or  more  contiguous  groups  of  neighboring  cells 
whose  values  are  one,   then,   at  the  termination  of  the  clock  pulse  corresponding  to  the 
subfield  to  which  it  is  assigned,   its  value  is  still  one.     Otherwise,   it  is  set  to  zero. 

In  Figure  3  CYC  refers  to  the  number  of  iterations,  FAC  is  ^T* minus  one,   and  CNUM1  and 
CNUMO  represent  X-j   and  Xq  respectively.     When  MODE  there  is  selT  to  8    (see  key  in  Figure  3) 
subfields  are  used  in  the  computation    (six  clock  pulses  per  iteration) ;  when  MODE  is 
zero  only  one  clock  pulse  is  used  per  iteration.     The  binary  value  of  BORD  in  Figure  3 
indicates  the  output  value  of  extreme  border  points  in  the  field  after  each  iteration. 

Computational  experiments.     A  small  set  of  the  64,   32  x  32  simulated  IR  data  arrays  was 
actually  employed  in  the  experiments  that  detected  a  track.     Figure  4  shows  this  as  a 
three-dimensional  window  over  17  time  slots  of  the  64.     The  subarray  selected  was  arbitrary: 
to  the  eye  its  contents   seem  noisy  and  it  is  not  clear  that  it  contains  a  track.  However, 
the  three-dimensional  transform  locates  the  track  and  shows  it  extending  over  8  of  the  17 
time  slots.     This  is  shown  in  Figure  5. 

Conclusion 

Cellular  logical  transformation  extends  to  three-dimensions  where  it  is  of  value  in 
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locating  tracks  in  noisy  data  arrays.     Cost  and  speed  advantages  accrue  to  this  type  of 
computing  device  for  the  track-detection  function.     Additional  experiments  should  be 
conducted  to  demonstrate  whether  track-resolving/identifying  capability  is  possessed  by 
such  devices  for  multi-track  environments.     Resolution  and  sizing    (cost,  practicality) 
considerations  should  be  addressed  for  realistic  applications.     Finally  the  regular 
decomposition    (quadtrees,   pyramids)   data  structure    (9-12)    and  cellular  architectures  (13,14) 
should  be  used  in  a  realistic  track-detection  processor  design. 
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Figure  1  Space  Decomposition 
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A  Three  -  Dimensional  Window  with  Original  Target  Regions 
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Track  Found  By  Three-Dimensional  Logic  Transform 
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Distributed  data  flow  signal  processors 
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Abstract 

Near  term  advances  in  technology  such  as  VHSIC  promise  revolutionary  progress  in 
programmable  signal  processor  capabilities.     However,  meeting  projected  signal  processing 
requirements  for  radar,   sonar  and  other  high  throughput  systems  requires  effective  multi- 
processor networks.     This  paper  describes  a  distributed  signal  processor  architecture 
currently  in  development  at  Texas  Instruments  that  is  designed  to  meet  these  high  through 
put,  multi-mode  system  requirements.     The  approach  supports  multiple,   functionally  spe- 
cialized,  autonomous  nodes   (processors)   interconnected  via  a  flexible,   high  speed  communi 
cation  network.     A  common  task  scheduling  mechanism  based  upon  "data  flow"  concepts  pro- 
vides an  efficient  high  level  programming  and  simulation  mechanism.     The  Ada  syntax  com- 
patible task  level  programming  and  simulation  software  support  tools  are  also  described. 

Background /problem 

During  recent  contractual  work   (VHSIC,   Etc.)  as  well  as  Texas  Instrument's  on-going 
strategic  planning  process,   the  need  for  a  new  signal  processing  architectural  approach 
was   identified.     It  was  recognized  that  sensor-based  processing  systems  face  operational 
requirements  for  acquisition  of  high  resolution  data  to  aid  in  target  imaging,  detection, 
classification,   identification,  prioritization,   and  tracking.     Automation  of  these 
operations  and  correlation  to  other  sensor  data  provides  increased  efficiency  of  data 
extraction  and  reduces  operator  load. 

Current  systems  with  these  operational  requirements  face  serious  performance  limit- 
ations,  particularly  in  the  presence  of  adverse  natural   (weather)  and  manmade  (ECM) 
environments.     A  chief  contributor  to  these  operational  deficiencies  is  the  lack  of 
versatile,  programmable,  high-throughput  signal  processors. ^ 

Current  processor  architectural  approaches  do  not  support  the  need  for  a  resource 
efficient  multisensor/multimode  signal  processor  whose  performance  capability,  in 
terms  of  memory  and  throughput,   can  be  tailored  to  meet  a  broad  spectrum  of  system 
applications.     Also,   the  method  for  programming  these  processors  does  not  permit  rapid 
development  of  application  software  which  can  be  easily  modified  during  the  system's  life 

In  early  1981,   a  Texas  Instruments  IR&D  program  was  initiated  to  develop  and  demon- 
strate a  processor  architecture  which  could  overcome  these  limitations . 2  The  following 
is  a  preliminary  description  of  a  system  called  the  TI-Data  Flow  Signal  Processor  (DFSP) 
which  is  the  subject  of  this  effort. 

DFSP  approach/rationale 

Initially,   a  number  of  architectural  features  were*  recognized  as  required  and/or 
desirable  in  supporting  future  systems. 

1)  flexibility 

capacity  scalable   (memory,   throughput,   bus  bandwidth,  etc.) 
support  for  functionally  specialized  units 
programmable   in  a  higher  order  language  (HOL) 

support  for  high  reliability  systems   (fault  tolerance,  graceful 
degradation) 

2)  integral/rational  maintenance /debug  capabilities 

3)  implementable  with  todays  technology 

4)  compatible  with  future  technology  (VHSIC) 

5)  maximum  design  commonality  within  a  system  and  between  different  system 
implementations . 

The  generic  approach  selected  for  the  DFSP,   shown  in  Figure  1,   consists  of  multiple, 
functionally  specialized,   autonomous  processors   (referred  to  as  nodes)  interconnected 
via  a  high-speed  bus  network.     This  multi-processor  approach  provides  an  inherent  capa- 
bility to  support  the  requirements  stated  above. 
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Traditionally  the  major  difficulty  with  multi-processor  systems  has  been  in  the  area 
of  task  partitioning/scheduling.     Multi-processor  systems  tend  to  have  "bottlenecks" 
which  limit  the  effectiveness  of  networking  multiple  resources.     Complex  software  that  is 
difficult  to  write  and  maintain  is  usually  required  to  control  such  a  system.     To  over- 
come this  problem  within  the  DFSP,   a  "Data  Flow"  control  approach  is   implemented  by  a 
dedicated  task  sequencer  within  each  node.     As  will  be  described  later,   this  "Data  Flow" 
based  task  level  scheduling  mechanism  which  is  programmed  in  a  higher  order  language  (Ada 
subset)  formalizes  the  typical  system  real  time  operating  system  scheduling  constructs. 


As  shown  in  Figure  1,  each  DFSP  node  consists  of  two  major  sections 
processor,   and  the  data  flow  sequencer  and  interface  (DFSI). 
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Figure  1 .     DFSP  generic  block  diagram 


Functional  processor.     The  functional  processor  is  the  portion  of  the  node  that  is 
specialized  to  perform  various  forms  of  data  "processing".     By  standardizing  the  interface 
to  this  functional  processor  and  limiting  its  function  to  that  of  a  task  executor  rather 
than  a  task  scheduler,  new  node  types  and  technology  upgrades  may  be  made  simply  and 
with  a  minimum  of  software  changes.     Currently,   the  following  functional  processor  types 
are  under  development: 

-  Vector  processor  -  high  throughput  signal  processing  type  processor 

-  Bulk  memory  -  large  capacity  global  memory  resource 

-  Input/output  -   interface  between  outside  world  and  DFSP 

-  Scalar  processor  -  general  purpose  computer,   provides  high  level  system  control 

DFSI .     The  DFSI   is  duplicated  in  each  node  and  provides  common  communication,  task 
scheduling,   and  maintenance  monitoring.     The  communication  network  interface  (CNI) 
provides  a  message  based  communication  capability  via  multiple,  potentially  redundant, 
communication  buses.     The  data  flow  sequencer  portion  of  the  DFSI  serves  as  the  overall 
node  controller.     Using  an  associated  control  memory  containing  downloaded  task  level 
control  information,   it  performs  I/O  control,   data  routing,   and  task  scheduling.  The 
monitor  function  provides  the  scalar  processor  and/or  external  maintenance  systems  with 
access  to  the  node  local  bus.     Under  program  control  the  monitor  selects  required  data  and 
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routes  it  as  programmed.   In  addition,   the  monitor  has  the  capability  to  form  and  forward 
a  pseudo  randomly  generated  "signature"  of  the  selected  data  sets.     This  capability 
supports  a  signature  analysis  based  bui It- in- test  philosophy. 

The  interaction  and  control  structures  of  the  various  portions  of  the  DFSP  node  are 
illustrated  in  Figure  2.     Contained  within  the  control  memory  associated  with  each  DFS 
are  a  series  of  descriptors  that  contain  parameters  describing  the  processor (s ) ,  tasks 
(blocks),   inputs   (operands)  and  outputs   (releases)  which  are  assigned  to  each  node.  Con- 
tained within  the  local  memory  of  the  processor  is  object  code  that  defines  the  processing 
to  be  done  by  each  task   (block)  as  well  as  input  and  output  buffer  areas. 
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Figure  2.     DFSP  control  structure 
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What  data  flow  is 


As  mentioned  earlier,   task  level  scheduling  in  the  DFSP  is  based  upon  the  "data  flow" 
concept.     The  concept  of  "data  flow"  control  has  been  an  ongoing  area  of  research  at 
Texas  Instruments  since  the  mid  70 's.     The  idea  originated  as  a  mechanism  to  perform 
instruction  level  sequencing  in  highly  parallel  general  purpose  computer  archi tectures . 3 
As  the  concept  matured,   it  became  apparent  that  the  idea  could  also  provide  a  coherent 
approach  to  task  level  scheduling  in  distributed  processing  networks.     It  is  this  aspect 
of  "data  flow"  that  we  are  using  in  the  DFSP  program.     The  following  paragraphs  describe 
the  fundamentals  of  the  data  flow  concept  and  illustrate  how  this  powerful  concept  forms 
the  basis  for  efficient  multiprocessor  software  design. 

The  cornerstone  of  data  flow  is  the  directed  graph  representation  of  a  problem.  A 
directed  graph,   shown  in  Figure  3,   consists  of  a  group  of  nodes  connected  by  arcs. 
Travel  from  node  to  node  occurs   in  only  one  direction  along  the  arcs   (i.e.,  directed). 


NODE 


Figure  3.     Directed  graph  example 

The  well  known  PERT  chart  used  in  project  planning  is  an  example  of  a  directed  graph. 
In  a  PERT  chart,   the  nodes  are  events,   and  the  arcs  indicate  activities  that  must  be 
completed  before  the  event  can  occur.     PERT  charts  are  a  valuable  tool  because  they 
clarify  the  relationships  between  all  parts  of  a  project:   i.e.,   which  tasks  depend  on 
others,  which  tasks  may  be  done  in  parallel,   and  which  lie  on  the  critical  (longest) 
time  path.   In  a  similar  fashion,   a  directed  graph  representation  of  a  radar  mode  will 
clearly  indicate  how  processing  tasks  depend  on  each  other,  which  may  be  done  in  parallel, 
and  timing  relationships  between  tasks. 

To  execute  a  signal  processing  mode  using  the  DFSP,   the  mode  is  modeled  as  a  directed 
graph.     Each  node  in  the  directed  graph  represents  a  processing  block   (task)   that  will 
be  assigned  to  a  processing  element.     The  arcs  represent  the  inputs   (data)  required  to 
execute  the  block.     A  properly  constructed  directed  graph  indicates  all  of  the  data 
required  for  the  execution  of  each  task.     As  such,   the  only  conditions  required  for 
enabling  the  execution  of  a  block  in  a  directed  graph  is  that: 

1 )  All  required  inputs  to  the  block  are  available 

2)  There  be  a  place  to  put  the  block  outputs 

This  procedure  for  executing  a  directed  graph  representation  of  a  mode  is  referred  to 
as  "data  flow  sequencing"  and  is  used  by  the  DFSP  for  task  scheduling. 

The  data  flow  sequencing  requirements  specify  when  a  block  may  be  executed.     They  do 
not  require  that  it  be  executed  at  that  time.     One  of  the  implementation  problems  in 
building  a  data  flow  system  occurs  because  more  processing  blocks  may  be  available  for 
execution  than  there  are  hardware  resources  to  execute  them.     Thus,   the  hardware  must 
provide  a  means  of  keeping  track  of  those  blocks  that  have  been  enabled  and  are  pending 
execution.     It  is  this  inherent  "look  ahead"  capability  that  contributes  to  keeping  data 
flow  control  overhead  to  a  minimum.  Given  a  data  flow  task  scheduler  working  independently 
of  the  processor,   task  scheduling  may  take  place  in  parallel  with  the  data  processing. 

As  stated  above,   there  are  two  conditions  to  be  satisfied  prior  to  enabling  a  block 
to  be  executed.     In  the  DFSP  the  first,   that  all  inputs  be  available,    is  enforced  by  the 
task  scheduler.     This  is  accomplished  via  a  parameter  associated  with  each  block  known 
as  the  "predecessor  count".     This  counter  is  initialized  with  a  count  of  the  number  of 
operands   (inputs)  required  by  the  block.     When  a  block's  operand  arrives  at  the  node, 
the  predecessor  count  for  that  block  is  decremented  by  one.     When  the  count  reaches  zero 
execution  of  the  block  is  enabled.     When  execution  is  complete,   the  count  is  restored. 
The  second  condition,   that  there  be  a  place  for  the  block  outputs,    is  enforced  through  a 
combination  of  the  inherent  structure  of  the  directed  graphs  and  the  concept  of  "forward- 
ing on  availability". 
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"Forwarding  on  availability"  is  a  new  term  for  an  obvious,  but  little  appreciated, 
concept.     Forwarding  on  availability  simply  means  that  a  block  sends   (or  forwards) 
copies  of  its  result  as  soon  as  it  is  calculated  to  every  other  block  in  the  system 
that  needs  it.     This  requires  that  each  block  have  sufficient  buffering  to  hold  the 
result.     The  value  of  forwarding  on  availability  is  clear  when  it  is  contrasted  with 
the  more  common  alternative  -  "fetching  on  demand".     When  data  is  fetched,   the  unit 
needing  the  data  issues  a  request  to  the  unit  containing  the  data.     The  requestor  is 
usually  delayed  until  the  data  is  returned.     In  a  conventional  uniprocessor,  the 
requestor  might  be  a  processor  and  the  responder  a  fast  memory  and  the  problems  are 
minimal.     However,   fetching  in  a  multiprocessor  environment  introduces  bus  and  communi- 
cation delay  as  well  as  memory  interference  which  significantly  reduces  efficiency. 

Data  flow  sequencing  as  a  control  mechanism  has  a  number  of  desirable  features. 
First,   it  can  be  distributed  since  the  information  necessary  to  determine  if  a  task 
may  be  performed  is  found  by  examining  only  the  node  of  the  graph  and  its  input  and 
output  arcs.     There  is  no  need  for  a  centralized  controller  to  decide  which  operations 
may  proceed  at  any  time.     This  eliminates  the  potential  controller  bottleneck  in  a 
highly  parallel  system    by  allowing  control  operations  to  be  distributed.  Second, 
rapid  mode  switching/interleaving  is  simple.     Using  Figure  3  as  an  example,    it  can  be 
seen  that  execution  of  an  entire  directed  graph  is  initiated  by  the  arrival  of  the 
input  data.     Thus,    it  is  possible  to  have  several  independent  graphs  resident  in  the 
DFSP  and  to  switch  modes  by  simply  directing  the  input  data  to  the  starting  block  of 
whichever  mode  is  to  be  performed.     Start  up  and  shutdown  of  modes  is  automatic  since 
by  definition  a  graph  always  executes  to  completion. 

Programmability/sof tware  support  tools 

A  major  goal  of  the  DFSP  program  is  to  simplify  software  development /maintenance 
for  programmable  processors.     This  is  achieved  through: 
Hierarchial  control 
Data  flow  task  scheduling 

Ada  description  of  an  operating  mode's  directed  graph 

General  purpose  vector  unit  macro  instruction  set 
The  programming  levels  shown  in  Figure  4  illustrate  the  hierarchial  relationship  between 
the  DFSP  system  elements. 


PROGRAMMING  LANGUAGE 


FUNCTIONS 


PASCAL/ADA 


SCALAR  PROCESSOR 

(990/10) 

MODE 

INFORMATION 

< 

ADA  SUBSET 


DATA  FLOW  SEQUENCER 
(DFS) 


MODE  CONTROL 
DEBUG  CAPABILITY 


PROCESS  SEQUENCING 
DATA  CONTROL 

SPECIFY  SIGNAL  PROCESSING 
STEPS 


CALLS  TO 

FUNCTIONAL  MACROS 


RTL  MICROCODE 
AND  MACROS 


VECTOR  PROCESSOR 


SIGNAL  PROCESSING  ALGORITHM 


Figure  4.     Programmabi lity  heirarchy 

The  scalar  processor  is  the  master  DFSP  control  element  and  is  responsible  for  sensor 
control,   control  panel  interface,  and  DFSP  mode  control.     To  change  operating  modes,  the 
scalar  processor  causes  the  DFS  control  memory  within  each  node  to  be  downloaded  with 
application  software  to  execute  the  new  mode.     This  approach  permits  the  system  to  recon- 
figure itself  in  the  event  of  a  node  failure,   and  also  reduces  the  size  of  the  control 
memory.     On  command  from  the  scalar  processor,   the  DFS ' s  act  autonomously  to  receive  block 
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operands,  schedule  block  execution,  and  forward  results  to  other  blocks  as  specified  by 
the  application  software.  The  functional  processor  interfaced  to  the  DFS  performs  the 
data  processing  functions  specified  by  the  block.  These  three  levels  of  programmability 
correspond  to  the  natural  decomposition  of  a  problem  and  provide  a  top  down  structure  to 
the  application  software.  This  top  down  structure  permits  incremental  software  develop- 
ment and  testing,  and  reduces  the  software  change  ripple- through  effects  which  make  many 
current  systems  costly  to  maintain. 

The  DFS  software  is  written  in  a  subset  of  Ada,   the  DoD  standard  high  level  language. 
Each  Ada  subprogram  represents  a  node  in  the  directed  graph.     The  subprograms  representing 
an  entire  graph  (i.e.,  mode)  are  collected  together  into  an  Ada  package.     Ada  compiler 
directives   (pragmas)  have  been  added  to  specify  node  type,   input  operand  count,   and  node 
replication   (fissioning).     Node  processing  commands   (e.g.,   FFT ,   etc.)  are  specified  as  Ada 
subprogram  references. 

A  set  of  macro  instructions  to  define  signal  processing  algorithms  is  microprogrammed 
into  the  vector  processor  unit.     The  chosen  macro  set  was  developed  by  Raytheon  Company's 
Submarine  Signal  Division  for  the  Naval  Air  Development  Center  and  is  summarized  in  the 
report  "Multisensor  Standard  Macro  Function  Study",  NADC  78188-50.^    These  macros  permit 
algorithms  to  be  defined  in  ten's  of  macro  instructions  rather  than  hundreds  of  micro- 
programmed instructions.     The  broad  nature  of  this  study  insures  that  a  flexible  set  of 
macros  is  developed,   and  thereby  any  future  application  dependent  microprogramming  effort 
is  minimized.     In  addition,   this  approach  standardizes  the  DFS/vector  processor  interface 
and  future  VHSIC  upgrades  of  the  vector  processor  hardware  will  only  require  reimplemen- 
tation  of  the  macro  set,  while  preserving  the  application  software  making  VHSIC  insertion 
cost  effective. 

Software  support  tools 

A  major  portion  of  DFSP  development  lies  in  the  creation  of  a  set  of  software  support 
tools  to  support  efficient  application  program  development  and  test.     The  support  tools 
under  development  may  be  divided  into  three  categories;   processor  support  tools,  data 
flow  support  tools,   and  system  debug/performance  monitoring  tools. 

The  processor  support  tools  include  the  compilers  and  instruction  level  simulators 
associated  with  each  of  the  processor  types.     In  general,   these  tools  have  previously  been 
developed  at  TI  and  are  being  adapted  for  DFSP  use.     Of  particular  interest  are  the 
tools  that  support  the  vector  processor.     These  include  a  sophisticated  register  transfer 
language   (RTL)  microcode  compiler,   and  an  efficient  RTL  simulator. 

DFS  support  tools  are  being  developed  to  allow  effective  utilization  of  the  DFSP. 
These  tools  will  support  task  level  programming  and  simulation  of  the  DFSP.  Four 
major  tools  are  being  developed: 

Model  file  generator  -  This  program  will  process  statements  that  describe 

the  system  configuration     and  create  a  model  file  for  use  by  other  tools. 

Typical  inputs   include  the  number  and  types  of  nodes,   communication  network 

structure,   and  "instruction"  execution  times. 

Block  compiler  -  This  program  compiles  Ada  syntax  compatible  block  programs 
into  a  linker  compatible  object  format.  Each  block  program  specifies  the 
inputs  and  outputs  from  the  block  as  well  as  the  processing  steps  which  make 
up  the  block  (task). 

Linker  -  The  linker  combines  all  of  the  block  programs  for  a  given  mode  into 
loadable  object  modules  for  each  node.     It  performs  task  to  logical  node 
assignment,  memory  management,   and  object  creation.     Linker  output  may  be 
either  DFSP  executable  load  modules  or  input  to  the  simulator. 
S imulator  -  The  simulator  is  an  event  driven  task  level  simulator  which  will 
provide  detailed  information  regarding  processor  loading,  communication 
bus  loading,  and  data  flow  graph  validation. 

The  system  debug/performance  monitoring  system  consists  of  a  set  of  software  resident 
in  the  DFSP  Support  System  Computer  that  will  provide  the  user  with  access  to  data  and 
timing  information  within  the  DFSP.     Typical  functions  callable  via  this  system  include: 
real  time  timeline  monitoring,   real  time  data  monitoring,   break  point  setup,  post  mortem 
system  state  dump,  etc. 

DFSP  breadboard  design 

In  order  to  effectively  demonstrate  the  DFSP,   a  breadboard  system  is  being  built. 
Figure  5  is  a  block  diagram  of  the  demonstration  system  with  its  associated  support 
equipment . 
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Figure  5.     DFSP  breadboard  block  diagram 

Major  characteristics  of  the  breadboard  system  include: 
Communication  network 

-  5  buses,   4  system  +  1  maintenance 

-  8  MHz,   16  bit  buses 
Vector  processors 

-  3  nodes 

-  180  MOPS  total  throughput 

-  LSI  based  programmable  processor 
Bulk  memory 

-  10  Mbits 

-  Programmable 

Scalar  processor  emulated  by  990  support  computer 
I/O  node 

-  high  speed  radar  data  input 

-  display  data  output 

Maintenance  interface  to  recorder  and  990  support  computer 
Pragmatics 

-  Single  rack 

-  Conventional  packaging  (DIP  in  board) 

-  Expandable  to  16  nodes  with  no  additional  racks. 
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Summary / future  plans 


The  Data  Flow  Signal  Processor  (DFSP)  may  be  characterized  as  a  federated  multi-node 
architecture  with  a  generalized  interconnect  network,   and  a  centralized  bulk  memory 
which  supports  multiple  instruction  multiple  data  (MIMD)  streams. 

The  architecture  is  being  developed  to  support  future  high  throughput  programmable 
signal  processor  applications.     Significant  emphasis  has  been  placed  upon  developing  an 
architecture  and  software  support  tools  that  permit  effective  utilization  of  the  strengths 
inherent  in  multi-processor  distributed  systems. 

Future  plans  are  predicated  upon  completion  of  a  breadboard  DFSP  system  in  the  near 
future.     A  number  of  current  and  anticipated  study  contracts  will  explore  the  effective- 
ness of  the  architecture  in  various  applications.     Upon  completion  of  the  breadboard 
system,   design  and  construction  of  a  prototype  system  is  planned.     The  prototype  system 
will  use  a  limited  number  of  custom  gate  array  chips  and  advanced  chip  carrier  packaging 
techniques.     Sucessful  completion  of  the  DFSP  effort  is  expected  to  result  in  a  very 
competitive  pre-VHSIC  signal  processor  system  with  future  growth  capability  through 
VHSIC  technology  insertion. 
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Abstract 

There  is  a  realization  that  the  future  of  high  speed  computing  will  see  many  great 
changes,  both  in  the  hardware  systems  to  which  people  are  presently  accustomed,  as  well  as 
in  the  software  and  algorithmic  approaches  currently  employed.     The  1980's  will  present 
the  utmost  challenges  in  meeting  the  burgeoning  computational  demands  of  high  speed 
computer  users.     Recent  advancements  made  in  the  field  of  multiprocessing  offer  strong 
promise  in  meeting  these  expanding  requirements.     The  Advanced  Flexible  Processor 
represents  a  quite  mature  implementation  of  one  of  those  "new"  multiprocessor  systems 
finding  major  application  to  signal  processing  and  data  handling  problems. 

Introduct  ion 

As  the  computational  performance  requirements  for  many  signal  processing  applications 
continues  to  increase,  new  methods  of  achieving  the  computational  power  required  to  meet 
these  demands  are  constantly  being  sought.     Technological  advances  at  the  circuit  level 
have  not  kept  pace  with  the  demands  for  greater  and  greater  computational  performance. 
The  application  of  multiprocessor  technology  to  these  compute  intensive  signal  processing 
problems  is  seen  evermore  clearly  as  the  answer  to  meeting  the  increasing  demands  for 
faster  and  faster  computational  systems.     Processing  tasks  considered  impossible  by 
conventional  computing  methods  are  realistically  achieveable  through  the  utilization  of 
multiple  processor  approaches. 

Multiprocessor  systems  that  have  been  developed  to  date  have  been  design  to  address 
problems  of  characteristically  singular  nature,   problems  that  required  either  very  little 
or  no  interaction  between  processing  modules,   or  problems  that  lended  themselves  to 
regular  array  processing.     The  Advanced  Flexible  Processor  is  a  parallel/pipelined 
multiprocessor  designed  to  provide  a  highly  flexible,   adaptable,  and  easily  expandable 
architecture  to  address  a  broad  class  of  signal  processing  and  data  handling  problems. 

The  AFP  is  a  very  powerful,   ultra  high  speed   (800  MOP)  machine  which  can  function  as  a 
processing  element  within  systems  of  up  to  256  AFP 1 s .     The  AFP  has  already  been  built  and 
system  deliveries  have  already  begun.     The  AFP  is  thus  significantly  ahead  of  its  time  and 
as  such  represents  a  unique  and  valuable  computational  tool  to  be  applied  to  high  speed 
real  and  near  real  time  processing  problems. 

Information  Sciences  Division  of  Control  Data  began  the  development  of  advanced 
multiple  processor  systems  in  1972.     The  primary  motivations  for  considering 
multiprocessor  computational  approaches  as  early  as  1972  were  the  overwhelming 
computational  requirements  confronted  when  attempting  the  digital  computation  of  certain 
image  processing  alogrithms  such  as  change  detection.     It  was  at  that  time  that  Control 
Data  began  the  development  of  the  Flexible  Processor  (FP) .     The  FP  system  was  a 
multiprocessor  design  to  address  the  demanding  computational  requirements  imposed  by 
certain  image  processing  problems. 

The  Flexible  Processor,   which  is  still  being  applied  to  high  speed  computational 
problems  today,   is  a  programmable,   special-purpose  computer  employing  a  highly  parallel 
architecture.     The  Flexible  Processor,   like  its  successor  the  Advanced  Flexible  Processor, 
was  designed  to  operate  as  an  individual  programmable  processing  element  in  an  array  of 
other  individually  programmable  elements.     The  Flexible  Processor  used  a  global  bus 
interconnection  system  between  processors.     Later  investigations  began  to  determine  other 
interconnection  network  architectures  which  might  prove  to  be  more  optimally  suited  for  a 
Multiple  Instruction,  Multiple  Data  Stream  (MIMD)   type  of  array  arch i tecturel » 2 , 3 .  The 
products  of  this  initial  research  into  various  interconnection  schemes  resulted  in  ISD 
developing  a  ring  connected  architectural  approach  to  linking  Flexible  Processors  in  large 
multiprocessing  arrays. 

In  1976  Control  Data  delivered  a  modular  change  detection  system  to  Wright-Patterson 
Air  Force  Base  which  consisted  of  forty  Flexible  Processors  configured  in  a  ring. 
However,   research  indicated  that  a  processor  capable  of  performing  at  computational  rates 
10  times  that  of  the  Flexible  Processor  would  be  required  to  meet  the  burgeoning 
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computational  demands  of  the  1980' s  4-7.     Thus,   Control  Data  Corporation  began  the 
development  of  the  Advanced  Flexible  Processor  using  the  latest  LSI  technology  which  was 
developed  by  CDC  for  use  in  its  most  advanced  Cyber  computers. 

Since  1972,   Control  Data  Corporation  has  been  involved  specifically  with  the 
application  of  multiprocessor  technology  to  a  broad  range  of  problems.     The  design  of  the 
Advanced  Flexible  Processor  reflects  the  experience  gained  over  these  years.     The  exposure 
to  the  special  issues  involved  in  the  direct  application  of  multiprocessor  solutions  to 
real  world  problems  has  forged  the  architectural  design  of  the  AFP.     The  coupling  of  that 
architectural  design  with  the  advanced,   state-of-the-art  circuit  technology  used  in  the 
implementation  of  the  AFP  has  resulted  in  processing  performance  roughly  twenty  times  that 
of  the  Flexible  Processor  and  system  performance  capabilities  far  in  excess  of  any  of 
todays  fastest  vector  processors. 

AFP  hardware  overview 


An  Advanced  Flexible  Processor  is  implemented  on  four  large  scale  integrated  (LSI) 
circuit  panels.     The  component  technology  is  emmitter  coupled  logic   (ECL)  chips.     Each  LSI 
panel  carries  a  total  of  approximately  500  F200K  ECL  logic  chips  and  1,100  ECL  100K  logic 
chips.     The  Advanced  Flexible  Processor  employs  the  same  freon  cooling  system  used  in 
CDC  s  Cyber  200  series  computers.     This  technology  provides  an  increased  reliability 
figure  at  the  chip  level  of  approximately  100  times  that  achievable  using  ECL  100K  logic 
chips  in  an  air-cooled  environment.     The  rough  computational  capabilities  provided  by  an 
array  of  16  Advanced  Flexible  Processors  would  be  approximately  3.2  billion  arithmetic  and 
logical  operations  per  second.     A  far  larger  number  of  operations  could  be  added  to  the 
total  if  one  were  to  count  the  many  operations  associated  with  operand  transfer  and  data 
management  which  are  concurrently  performed  by  the  AFP  in  support  of  the  arithmentic  and 
logical  computations. 


Interconect ion  architecture 


AFP  systems  employ  two  means  to  accomodate  interprocessor  communication;   a  ring 
connected  communication  system,   and  a  common  memory  structure  through  which  data  can  also 
be  transfered  between  system  processors.     The  interprocessor  communication  between  two 
adjacent  Advanced  Flexible  Processors  in  the  communications  ring  is  approximately  800 
million  bits  per  second  per  ring  (each  AFP  can  communicate  over  two  rings 
simultaneously).     Data  can  be  transferred  between  any  system  AFP  and  central  memory  at  a 
rate  of  1.6  billion  bits  per  second.     A  unique  characteristic  of  the  ring  connected 
architecture  employed  by  the  Advanced  Flexible  Processor  provides  a  distinct  advantage  in 
the  performance  capability  of  the  AFP  multiprocessor  systems.     Program  partitioning 
strategies  allow  one  to  realize  proportional  increases  in  available  ring  system 
intercommunication  bandwidth  as  processors  are  added  to  the  multiprocessor  array.  This 
feature  is  in  direct  contrast  to  other  multiprocessor  architectures  in  which 
interprocessor  communication  is  strangulated  as  processors  are  added  to  the  system.     As  a 
result  of  this  unique  feature,   an  array  of  16  Advanced  Flexible  Processors  may  provide 
overall  system  bandwidth  for  intercommunications  of  26  billion  bits  per  second. 


Advanced  Flexible  Processor  architecture 


The  Advanced  Flexible  Processor  is  a  unique  and  powerful  architecture  providing  an 
extremely  high  degree  of  flexibility  and  cost-effectiveness.     It  consists  of  16  relatively 
autonomous  functional  units  interconnected  by  a  power  16  x  18  port,  crossbar 
interconnect.     Each  of  the  data  paths  interconnected  by  the  crossbar  is  16  bits  wide. 
Table  1  describes  the  functional  unit  breakdown  of  the  Advanced  Flexible  Processor.  A 
conceptualized  functional  organization  of  the  AFP  is  shown  in  Figure  1. 

Computations  may  be  streamed  through  the  Advanced  Flexible  Processor  very  efficiently 
due  to  dual  I/O  port  characteristics  of  the  internal  architecture.     Data  elements  may  be 
independently  streamed  in  and  out  of  the  Advanced  Flexible  Processor  through  any  one  or 
all  of  the  four  I/O  channels.     For  example,  data  may  be  streamed  in  through  one  of  the 
memory  I/O  channels,   computations  performed,   and  then  streamed  out  through  one  of  the 
other  three  I/O  channels  simultaneously. 

Multifunctional  parallelism 

The  internal  architecture  of  each  Advanced  Flexible  Processor  allows  multiple 
computational  streams  to  be  constructed  and  executed  in  parallel.     By  way  of  example,  one 
might  imagine  the  multiply  unit  receiving  two  operands;   one  from  a  memory  I/O  port  and  the 
other  from  one  of  the  data  memories.     Simultaneously  in  the  same  machine  cycle  one  of  the 
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Table  1.     Functional  unit  breakdown 


Number  of 

Type  of  Functional  Unit 

Number  of 

Un  i  t  s 

Pipelined 
Segments 

2 

External  Memory  Access  Unit 

1 

2 

Ring  Port  I/O  Units 

1 

1 

Control  Unit 

2 

2 

Adders  Unit 

2 

1 

Multiplier  Unit 

3 

2 

Shift  Boolean/Logic  Unit 

2 

4 

2K  Data  Memory  Units 

2 

2 

8  Word  File  Registers 

2 

RING  RING 


ADVANCED  FLEXIBLE  PROCESSOR 


Figure  1.     Functional  organization  concept  of  the  Advanced  Flexible  Processor 


adders  may  be  receiving  a  product  computed  by  the  multiplier  on  a  previous  machine  cycle 
along  with  a  data  element  accessed  from  one  of  the  remaining  three  data  memories  as  inputs 
to  an  addition  operation.     The  remaining  adder  may  simultaneously  be  using  the  sum 
produced  on  a  previous  machine  cycle  along  with  a  result  from  the  shift  boolean  unit  to 
perform  a  subtraction.     The  capability  for  parallel  autonomous  execution  over  the  range  of 
arithmetic,   shift-boolean,   and  data  manipulation  operators  results  in  achieving  extremely 
high  computational  rates  for  each  AFP  in  the  system. 

Internal  pipelined  processing 

Each  of  the  internal  functional  units  of  the  AFP  are  I/O  buffered  to  their  respective 
crossbar  ports  as  shown  in  Figure  2.     Each  functional  unit  is  equipped  with  input  latch 
registers,   buffering  the  crossbar  inputs,   and  output  latch  registers,  buffering  the 
functional  unit  outputs  to  the  crossbar.  This  design  allows  the  intermediate  storage  of 
variables  between  the  functional  units  and  thus  allows  the  functional  units  of  the  AFP  to 
be  pipelined  together  with  the  maximum  flexibility.     Single  or  multiple  pipelined  chains 
are  easily  supported  through  the  crossbar  as  a  result  of  this  method  of  "direct  data 
hand-off"  between  the  functional  units. 

Advanced  Flexible  Processor  performance 

The  machine  cycle  time  of  the  Advanced  Flexible  Processor  is  21.7  nanoseconds.  Every 
functional  unit  can  provide  results  every  21.7  nanoseconds.     Thus  roughly,   50  million 
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Figure  2.     Register  level  organization  of  a  generic  AFP  functional  unit 


16-bit  multiplies,  200  million  16-bit  data  memory  references,   100  million  boolean  shift 
operations,   and  100  million  16-bit  adds  or  subtracts,   etc.  can  be  performed  every  second. 
Trie  maximum  operational  speed  of  the  Advanced  Flexible  Processor,   therefore,   is  250 
million,   16-bit,   arithmetic  operations  per  second. 

AFP  I/O  performance 

The  ring  port  I/O  unit  provides  the  interface  for  each  Advanced  Flexible  Processor  to 
the  ring  interconnect  system.     Two  ring  ports  are  provided  to  each  Advanced  Flexible 
Processor  and  thus  the  capability  for  dual-ring  interconnection  systems  exists.     The  ring 
port  I/O  unit  handles  all  of  the  data  management,   synchronization,  and  protocol  required 
to  communicate  on  the  ring  system  without  interrupting  the  arithmetic  processing  of  the 
Advanced  Flexible  Processor. 

The  external  memory  access  units  provide  the  interface  between  the  AFP  and  the  central, 
high-performance,   random  access  memory  store.     Each  external  memory  access  unit  can 
provide  peak  data  I/O  rates  of  3.2  billion  bits  per  second  and  sustained  I/O  rates  of  800 
million  bits  per  second.     Thus,   the  total  sustained  capability  of  an  Advanced  Flexible 
Processor  from  the  two  ring  port  I/O  units  and  the  two  external  memory  access  units  is  3.2 
billion  bits  per  second. 

AFP  computational  performance 

The  multiply  unit  of  the  Advanced  Flexible  Processor  provides  the  capability  to  produce 
two  16-bit  products  or  one  32-bit  product  every  21.7  nanosecond  machine  cycle.  The 
multiplier  also  provides  the  capability  to  do  population  and  significant  counts.     The  two 
adders  provide  the  capability  of  performing  four  8-bit  adds,   two  16-bit  adds,  or  one 
32-bit  add  every  21.7  nanosecond  machine  cycle.     The  shift  boolean  units  allow  barrel 
shifts  of  up  to  15  bits  performed  every  21.7  nanosecond     machine  cycle  and  is  capable  of 
performing  all  of  the  16  basic  boolean  logic  functions.     Each  data  memory  allows  the 
reading  or  writing  of  one  16-bit  word  every  machine  cycle.     The  file  memories  allow  the 
reading  and  writing  of  four  16-bit  words  every  machine  cycle.     The  control  unit  manages 
program  execution  and  handles  branching  and  accessing  of  programming  instructions. 

The  individual  program  memory  within  the  control  unit  of  each  AFP  consists  of 
1,024  program  instructions.     Each  program  instruction  is  200  bits  wide  and  provides  the 
capability  of  issuing  39  instruction  parcels  every  21.7  nanoseconds.     The  control 
bandwidth  of  the  AFP  is  thus  very  high,  and  allows  a  flexibility  in  control  for  the  easy 
management  and  execution  of  the  16  functional  units  along  with  crossbar  reconfiguration  on 
a  machine  cycle  basis.     As  a  result,   the  Advanced  Flexible  Processor  is  capable  of 
performing  roughly  100  million,   250  million,   or  500  million  arithmetic  and  logic 
operations  every  second  in  the  32-bit,   16-bit,   or  8-bit  modes  of  operation  respectively. 
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Comparison  testing 


Latching  registers  as  shown  in  Figure  2  are  also  provided  within  13  of  the  funtional 
units  for  the  storage  of  input  comprand  values.     Arithmetic  computations  and  testing  of 
resultant  outputs  can  thus  be  concurrently  performed  within  these  functional  units.  The 
current  conditional  status  of  these  functional  units  can  be  provided  to  the  control  unit 
every  machine  cycle  for  branch  decision  processing.     These  comparison  results  are  thus 
availble  to  the  controller  from  these  funtional  units  at  a  rate  of  650  million  per 
second.     These  results  are  produced  totally  in  parallel  with  the  arithmetic  and  file 
management  operations  performed  within  the  AFP  as  previously  described. 

On  average,  a  typical  computational  process  can  keep  four  of  the  arithmetic  functional 
units  plus  several  memory  and  I/O  units  busy  concurrently,   allowing  a  single  AFP  to 
achieve  an  average  computational  rate  of  about  200  to  250  million  16-bit  arithmetic  and 
logical  operations  per  second.     Examples  of  specific  application  realated  performance 
figures  are  discussed  following  the  AFP  systems  architecture  section. 

The  features  provided  by  a  single  AFP  are  summarized  in  Table  2.     The  very  modular 
construction  of  AFP  systems  and  of  the  AFP  itself  allows  for  very  cost-effective  system 
implement at  ion . 

Table  2.     Single  AFP  features 


FEATURE 

ADVANTAGE 

250  Million  arithmetic  computations 
per  second  for  each  AFP 

Espandable  compute  power  to  match 
applicat  ion 

Functionally  designated  intermediate 
operand  registers 

Allows  uninterrupted  computation 
streaming,  eliminating  register 
reservation  hiccups 

Direct  data  hand-off  between  16 
functional  units  through  crossbar 
switch 

Provides  broadest  capability  for 
multiple  chaining  with  no  require- 
ments on  operand  interdependence 

Data  fan  out  of  1:16  on  all 
functional  units 

Eliminates  operand  contention, 
allowing  multiple  use  of  a  single 
operand  in  one  machine  cycle. 

Four  independent  data  memories 
providing  concurrent  access  and 
combined  capability  to  supply  16 
input  requests  simultaneously 

Provides  8  KB  of  circulation  vector 
storage,  avoiding  costly  vector 
length  start-up  times 

Four  independent  I/O  ports  providing 
simultaneous  read/write  access  to  HPR 
memory 

Eliminates  vector  length  hiccups 
in  computation  stream 

Peak  bandwidth 

8  billion  bits/second 

Sustained  bandwidth 

3.2  Billion  bits/second 

200  bit  wide  instruction  packet 

Instruction  issue  rate  is 

39  instruction  parcels/cycle  or 

2  billion  instructions  per  second 

Transparent  single  level  interrupt 
exchange  management 

No  special  interrupt  exchange 
software  packages  required 

Instruction  cache  size  of  1024 
Instruction  packets,  each  200  bits 
wide 

40  thousand  instruction  parcels 
per  program  interval 
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AFP  system  architecture 


System  arrays  of  Advanced  Flexible  Processors  are  linked  together  and  synchronized  via 
facilities  provided  by  the  ring  port  functional  units.     Data  elements  16  bits  wide,  along 
with  12  bits  of  control  information,   are  passed  between  ring  ports  on  adjacent  AFP's.  The 
control  information  provides  all  of  the  associated  addressing  information  to  define  the 
single  processor  or  subset  of  system  processors  to  which  the  message  is  to  be  sent. 
Information  identifying  the  appropriate  data  register  file  in  which  the  incoming  data 
element  is  to  be  stored  is  also  contained  within  the  control  field  of  the  ring  packet. 

Each  data  memory  is  capable  of  defining  16  independent  data  files.     Designated  bits  within 
the  control  field  provide  inter processor  synchronization  information  as  well.  Facilities 
within  the  ring  port  provide  the  logic  capabilities  to  use  these  designated  bits  to 
achieve  cross  file  synchronization.     These  features  assure  that  a  processor  is  not  capable 
of  beginning  a  computational  task  until  the  appropriate  single  data  file  or  set  of  data 
files  which  are  to  be  used  as  operands  in  the  pending  computation  are  stored  away  in  the 
processor . 

These  synchronizing  control  features  also  prevent  another  processor  from  over-writing 
files  within  a  computing  processor.     Input  and  output  FIFO  buffering  provides  elasticity 
in  communication  between  processors  on  the  ring  systems  to  minimize  processor  idle  time. 
Thus,  due  to  the  built  in  capabilities  of  the  ring  port  functional  units,   the  processing 
elements  are  released  from  the  inflexible  lock-step  synchronization  required  of  other 
single  instruction,  multiple  data  stream  (SIMD)  machines  and  multiple  instruction, 
multiple  data  stream  (MIMD)  machines.     Further,   the  system  allows  for  the  construction  of 
multiple  elastic  pipelines  to  be  created  across  system  AFP's,  which  function  as  powerful 
processing  elements  in  the  dual  ring  connected  architecture. 


A  minimum  AFP  system 


The  AFP  can  be  employed  singly  as  an  attached  processor  to  a  general  purpose  host 
computer,  presently  interfaces  are  developed  for  the  PDP  11/70.     The  AFP  communicates  to 
the  host  computer  over  a  ring  channel.     The  host  computer  is  interface  to  the  ring  network 
via  a  modified  ring  port   (MRP)  connection  as  shown  in  Figure  3.     An  AFP  operating  as  an 
attached  processor  in  this  configuration  would  significantly  enhance  system  performance  of 
the  host  processor  by  augmenting  its  computational  capability.     The  ring  port  interface 
units  through  which  rings  of  AFP's  may  be  interconnected  are  indicated  in  figure  3  by  the 
abbreviation  RP(). 

Multiprocessor  AFP  systems 

AFP's  can  be  easily  added  to  the  minimum  system  shown  in  Figure  3.     A  typical 
multiprocessor  expansion  is  shown  in  Figure  4.     AFP's  are  interconnected  on  the  host  ring 
with  each  additional  AFP  augmenting  the  computational  capabilities  of  the  system  by  250 
million  arithmetic  operations  per  second.     An  additional  ring  interconnection  channel, 
shown  in  Figure  4,   is  also  provided  for  interprocessor  communication  and  control.     Up  to 
256  Advanced  Flexible  Processors  can  be  supported  on  each  system  ring;   however,  typical 
systems  are  seen  as  being  in  the  range  of  4  to  32  processors. 


PDP 

DISK 
TAPE 
CRT 

1  1/70 

MRP/C 

AFP 


np<s> 


Figure  3.     Miminmum  AFP  system 
conf  igurat  ion 


Figure  4.     Typical  AFP  system  configuration 

showing  capabilities  for  expansion 
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Centralized  high  performance  memory 


memory  access  unit.  (XMAU)  of  the  AFP '  s  are  managed  by  the  Storage  Access  ControUer 
£SSm-  Multiple  SAC;s  may  be  employed  as  memory  requirements  are  expanded.     Each  SAC  is 
bxL  per  second"  g  *"*  ^  ^  AF?  a"ay  3t  3  ""aineS  rate  ofM  bUHon 

1*  ^ff -^enKraliZed'  hifh-Performance  memory  store  may  be  expanded  from  125  kilobytes  to 
IL  III        jVleSt  Providin§  a  maximum  memory  bandwidth  of  12.8  billion  bytes  per  second 
The  advanced  technique  of  processor  intercommunications  significantly  r educesPp5ocessor ' 
idle  time      Processor  idle  time  is  further  reduced  through  a  sophisticated  hierarchical 
approach  to  mass  memory  and  I/O  management,  which  ensure!  contiguous  data  suopor£  to  "he 
processing  elements  and  a  continuous  computational  flow.     All  memory  and  communication 
paths  are  designed  to  support  extremely  high  bandwidths.  communication 

AFP  system  performance  for  specific  applications 

A  number  of  specific  applications  for  the  AFP  have  been  studied  at  Informationa 
.cc!eo6^  °lvl^on-     The  performance  of  single  and  multiprocessor  systems  of  AFP's  has  been 
assessed  for  these  applications.     The  computational  performance  of  the  Advanced  Flexible 

n°f     °n  VePresentative  ^t  of  these  algorithms  is  shown  in  Table  3.     Beyond  these 
areas  of  investigation  there  are  yet  broader  applications  for  the  Advanced  Flexible 
appffllgfona^rritSf^gg^rg^ilSfesseS"8  retrieval  systemsS  as  well  as  floating  point 


Table  3.     AFP  application  performance 


APPLICATION 

NUMBER 
OF  AFP'S 

KERNAL 
RATE 

TOTAL 
TIME 

COMPLEX  FFT 

1024  POINT 

16  BIT  ACCURACY 

1 

4 

80  ns/BUTTERFLY 
20  ns/BUTTERFLY 

0.4  msec 
0 . 1  msec 

GEOLOCATION 
100,000  MESSAGES 
100  LOCATIONS  OF 
INTEREST 

1 

40  ns/PAIR 

0.4  sec 
10  MILLION 
COMPARES 

2-DIMENSIONSAL 
MATRIX 

DECONVOLUTION 

(55  X  80)  ELEMENTS 

1 

20  ns  PER 
MULTIPLY-ADD 

6.6  msec 

MULTI SPECTORAL 
CLASSIFICATION 
(128  X  128)  POINTS 

16 

480  ns/POINT 
COMPUTATION 

1240  OPERATIONS 
PER  POINT 

8  msec 

2.58  BILLION 
OPERATIONS/SEC 

MATRIX  INVERSE 
(50  X  50)  POINTS 

1 

2  usec/POINT 

5 . 0  msec 

MATRIX  TRANSPOSE 
(32  X  64)  ARRAY 

10  nsec/POINT 

20  usee 

(1024  X  1024) 

1 

20  nsec/POINT 

21  msec 
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AFP  software  facilities 


The  programming  of  the  AFP  system  is  presently  done  through  the  use  of  two  very 
powerful  software  development  tools.     The  first  of  these  tools  is  the  AFP  cross  assembler, 
MICA.     The  second  tool  is  the  AFP  instruction  level  simulator,  ECHOS.     The  MICA  cross 
assembler  and  the  ECHOS  instruction  level  simulator  allow  all  programming  to  be  done 
"off-line."     AFP  programs  are  written  in  the  MICA  language  on  a  Control  Data  Cyber  700 
series  computer.     The  edited  files  are  then  processed  by  the  MICA  cross  assembler.  MICA 
checks  for  all  illegal  lexical  and  syntax  usages  as  well  as  illegal  hardware  usages. 
Functional  unit  and  crossbar  usage  conflicts  are  identified  to  the  programmer  through  the 
facilities  of  MICA.     MICA  produces  a  binary  file  of  the  submitted  program  which  runs 
directly  on  the  Advanced  Flexible  Processor. 

The  binary  file  produced  by  MICA  also  runs  directly  on  ECHOS  the  AFP  instruction 
simulator.     ECHOS  provides  a  register  level  simulation  of  the  submitted  program.  ECHOS 
interactively  executes  the  program  in  software  precisely  the  way  the  program  will  run  in 
the  Advanced  Flexible  Processor.     ECHOS  simulates  the  execution  of  a  program  partitioned 
over  multiple  AFP's  as  well  as  the  execution  of  a  program  running  on  a  single  AFP.  A 
programmer  can  single  step  through  his  program  specifying  the  print  out  of  all  or  a 
selected  set  of  functional  registers  in  the  AFP.     The  accuracy,  power,   and  detail  of  the 
ECHOS  simulator  allows  a  programmer  to  confidently  expect  his  program  to  run  the  very 
first  time  it  is  run  on  an  AFP.     Thus,   programming  activities  can  be  carried  out  with  no 
interruption  to  useful  AFP  system  data  processing. 

The  programming  support  offered  for  the  AFP  is  extensive,  making  AFP  programming  simple 
and  direct.     It  is  nevertheless  realized  that  higher  level  language  programming  would 
provide  even  greater  utility  and  a  more  generally  useable  interface.     Control  data  is 
currently  involved  in  an  effort  to  develop  a  FORTRAN  compiler  for  the  AFP.     The  initial 
Fortran  compiler  for  the  AFP  will  be  an  optimizing  compiler  and  will  use  the  highly 
parallel  functional  architecture  of  each  AFP  efficiently.     The  development  of  This  HLL 
interface  will  further  enhance  the  programmability  of  the  AFP,  broadening  the  user  base  of 
AFP  systems. 

Cone lus  ions 

The  Advanced  Flexible  Processor  is  a  unique  entry  into  the  multiprocessing  field.  It 
provides  the  dynamic  capabilities  offered  by  an  MIMD  machine  with  advanced  features  such 
as  a  sophisticated  interprocessor  ring  communications  network;   efficient  utilization  of 
the  system  processors  is  therefore  effected.     Within  each  Advanced  Flexible  Processor, 
dynamic  multiple  chaining  can  be  achieved  due  to  the  superior  flexibility  of  the 
intr a-processor  crossbar.     Multiple  functional  units  can  be  executed  simultaneously  with 
each  functional  unit  providing  a  broad  range  of  instruction  defined  operational 
capabilities.     Multiple  comparisons  are  available  within  each  machine  cycle  for 
simultaneous  multiple  condition  sensing. 

Programming  tools  already  developed  for  the  Advanced  Flexible  Processor  system  are  robust; 
however,   in  addition  to  those  tools  already  developed,   efforts  are  currently  underway  to 
develop  higher  level  language  tools  for  the  AFP.     The  development  of  these  tools  adapt 
state-of-the-art  compiler  techniques  to  obtain  horizontally  optimized  machine  code  within 
each  AFP  processor.     Global  system  efficiency  is  sought  through  the  application  of  data 
flow  program  analysis  techniques  to  achieve  optimized  processor  utilization  across  the 
multiprocessor  system. 
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Abstract 

This  paper  describes  the  S-1  multiprocessor  system.     It   is  composed  of  16  supercomputer 
class  uniprocessors  with  local  caches,  an  extremely  large,  medium  latency  shared  memory, 
and  a  low  latency  synchronization  bus  for  passing  short  messages.     The  system  is  applicable 
to  a  wide  variety  of  applications,   including  large-scale  physical  simulation,  real-time 
command  and  control,  and  program  development   in  a  time-sharing  environment.     The  hardware 
organization,   its  implications,  and  software  supporting  the  efficient  utilization  of  the 
multiprocessor  are  discussed. 

Int  roduct  ion 

The  S-1  Project   [1]   is  engaged  in  the  development  of  advanced  digital  processing  tech- 
nology for  potential  application  in  the  military  and  scientific  communities.     Current  work 
being  sponsored  by  the  U.S.  Navy  and  the  Department  of  Energy  involves  the  design  and 
development  of  extremely  high  performance,  general  purpose  computers   (S-1)   and  multiprocessor 
interconnection  technologies. 

The  reasons  for  development  of  multiprocessors  have  been  widely  discussed;  chief  among 
them  are  reliability,  economy  and  scale.     We  place  heavy  —   though  not  exclusive   —  emphasis 
on  the  issue  of  scale. 

Today,  there  are  a  number  of  important  problems  for  which  manual  solution  is  infeasible, 
yet  cannot  be  handled  by  existing  computers  because  they  have  insufficient  computing  power 
[2].     As  an  example,  the  ability  to  provide  an  accurate  two  week  weather  forecast  would 
have  extraordinary  economic  leverage.     It  would  allow  farmers  to  select  optimal  times  for 
planting  and  harvesting,  and  provide  substantial  warning  of  natural  disasters  to  minimize 
loss  of  life  and  property.     The  latest  computational  methods  for  weather  prediction  are 
believed  to  be  adequate  for  the  task;  unfortunately,   they  overwhelm  the  computing  and 
storage  capacity  that  is  available  today.     Develoment  of  new  oil  and  mineral  resources  is 
of  vital  national   importance.     Much  of  the  exploration  being  conducted  involves  seismic 
data  processing,  and  employs  vast  computer  resources.     Effective  utilization  of  the  new 
semiconductor  technology,   specifically  very  large  scale  integration   (VLSI),   is  limited  by 
our  ability  to  design  and  debug  circuits   involving  hundreds  of  thousands  of  transistors. 
Computer-aided  design  techniques,  such  as  the  Project's  SCALD  system   [3],  have  been 
demonstrated  to  greatly  reduce  development  time  of  new  digital  systems;  however,  their  use 
is  effectively  limited  to  designs  of  moderate  size  because  of  capacity  limitations. 
Similar  limitations  are  seen  in  a  variety  of  military  applications. 

Given  a  particular  logic  technology,  there  is  a  limit  to  performance  that  can  be  obtained 
regardless  of  the  complexity  or  cleverness  of  the  processor  design.     Today's  fastest 
processors  using  commercially  available  components  have  a  peak  performance  in  the  10-40  MIPS 
(million  instruction  per  second)   regime  for  scalar  operations  and  100-400  MFLOPS  (million 
floating  point  operations  per  second)   for  vector  operations.     A  multiprocessor,  however,  can 
exceed  the  inherent  limitations  on  a  single  processor  by  performing  computations  in  parallel 

Multiprocessor  systems  which  have  been  demonstrated  to  date  fall   into  roughly  two 
categories.     The  first  includes  systems  that  have  a  large  number  of  small  scale  processors 
(minicomputers  or  microcomputers).     Examples   include  most  of  the  early  research  multi- 
processors such  as  CM*.     Aggregate  system  performance  is  limited  because  of  the  limited 
performance  of  the  processing  elements,  and  the  limited  number  of  processing  elements 
connected  together.     The  second  category  encompasses  systems  that  have  a  small  number  of 
medium  scale  processors   (small  mainframes).     This  approach  has  been  taken  in  several 
commercial  offerings  that  provide  cost  effective  performance  enhancement  for  batch  or  time- 
sharing applications  through  dual-processor  configurations.     Aggregate  system  performance 
is  not  an  issue  in  these  systems  as  they  are  used  to  run  more  jobs  rather  than  a  single 
job  faster. 

The  S-1  Project  is  taking  the  unique  approach  of  assembling  a  multiprocessor  consisting 
of  up  to  sixteen  uniprocessors,  each  of  which  have  a  performance  comparable  to  that  of  the 
fastest  supercomputers.     This  paper  will  address  four  topics:     design  of  the  uniprocessors, 
the  multiprocessor  architecture,  operating  system  support,  and  the  tools  for  partitioning 
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single  problems  for  a  multiprocessor. 


Uniprocessors 

For  use  in  the  multiprocessor,  we  are  developing  a  family  of  processors  having  similar 
architectures,  but  differing  implementation  technology.     Each  successive  family  member  is 
intended  to  make  maximally  effective  use  of  the  then  available  logic  families.     Such  a 
succession  of  processors  is  required  in  order  to  maintain  the  multiprocessor's  edge. 
Advances  in  semiconductors  are  occurring  at  such  a  rapid  rate  that  a  multiprocessor  tied 
to  one  particular  technology  would  soon  be  made  obsolete  by  single  processors  having  an 
order  of  magnitude  greater  speed. 

The  first  generation  of  the  S-l  family  of  processors  is  the  Mark  I,  which  has  been 
operational  since  1978.     Implemented  in  ECL-10K  medium  scale  integrated  circuits   (MSI),  it 
is  roughly  equivalent  in  processing  power  to  one-third  of  a  CDC  7600.     The  second  generation, 
the  Mark  IIA,   is  currently  undergoing  initial  checkout.     Through  use  of  extensive  hardware 
support  for  vector  and  floating  point  computations,  and  faster  logic   (ECL-100K  MSI),   it  is 
expected  to  achieve  performance  comparable  to  existing  supercomputers  such  as  the  Cray-1. 
Future  generations  are  planned  that  will  follow  the  leading  edge  of  implementation 
technologies  to  obtain  ever  increasing  performance  and  ever  decreasing  cost,  power  and 
space  requirements.     The  S-l  Mark  V,   targeted  for  development  in  1985,   is  intended  to  be  a 
"supercomputer  on  a  wafer"  with  performance  2-3  times  that  of  the  Mark  IIA. 

Unlike  traditional  supercomputers  which  sacrifice  functionality  for  performance,  the 

architecture  of  the  S-l  uniprocessors  has  been  designed  to  be  easy  and  efficient  to  use  for 

a  wide  variety  of  applications.     In  this,   it  closely  resembles  the  highly  popular  mini- 
mainframes  which  stress  flexibility  over  performance. 

The  architecture  was  designed  with  a  number  of  goals  in  mind.     First,   it  must  be  suitable 
for  high  performance  implementation;   second,   it  must  be  simple  for  a  high  level  language 
(e.g.  Ada)  compiler  to  make  effective  use  of  instruction  set;   and  third,   it  must  provide  a 
comprehensive  set  of  data  types  and  operations  so  that  the  programmer  can  select  the 
arithmetic  precision  appropriate  to  a  problem. 

In  addition,  to  the  usual  general  purpose  features,   the  S-l  architecture  has  incorporated 
a  number  of  special  purpose  operations  to  provide  especially  high  performance  for  its 
anticipated  applications.     Many  scientific  codes  make  heavy  use  of  elementary  functions  such 
as  sine,  cosine,  exponentials,  and  logarithms.     The  architecture  provides  these  functions  as 
single  instructions,  and  the  Mark  IIA  has  special  hardware  to  permit  the  instructions  to 
execute  at  about  the  same  speed  as  a  simple  multiply.     An  extensive  vector  instruction  set 
is  provided  to  enhance  performance  on  problems  that  manipulate  large  arrays  of  data.  Special 
vector  instructions  are  provided  for  signal  processing  applications.     Examples  include  FFT's 
and  filtering  operations.     Matrix  operations  are  also  supported,   including  matrix  multiply 
and  generalized  transpose.     Because  the  S-l   implementations  are  uniformly  cached-based ,  all 
vector  instructions  execute  with  a  one  element  step  size  to  avoid  inefficient  use  of  the 
cache.     In  cases  where  the  problem  requires  non-unity  step  sizes,   the  transpose  instruction 
can  be  used  to  extract  the  relevant  elements  into  a  unity  step  size  temporary  vector. 

The  S-l  architecture  provides  the  user  with  a  large,   segmented  virtual  address  space 
spanning  2  billion  9-bit  bytes  of  data.     Memory  capacity  on  this  scale  is  crucial  for  the 
effective  solution  of  large  problems  such  as  three-dimensional  physical  simulations.  The 
large  address  space  allows  all  the  problem  data  to  reside  directly  in  memory  in  the  obvious 
fashion,  and  eliminates  the  programming  contrivances  needed  to  explicitly  manage  multiple 
types  of  computer  system  storage   (i.e.,  manually  swapping  data  to  and  from  a  disk  file). 
A  virtual  memory  mechanism  maps  the  virtual  address  space  to  physical  memory.     In  the  event 
that  the  user's  memory  requirements  exceed  physical  capacity,   it  is  possible  for  the 
operating  system  to  simulate  the  additional  memory  with  a  slight  performance  penalty;  this 
avoids  the  problem  of  a  program  "falling  off  a  memory  cliff".     With  today's  rapidly 
decreasing  costs  of  memory,  however,   it  is  economical  to  purchase  sufficient  memory  to  meet 
the  requirements  of  even  the  largest  programs. 

Multiprocessor  Systems 

The  S-l  Multiprocessor  System  is  a  MIMD  (multiple  instruct  ion ,  mult ipl  e  data)  stream 
organization.     The  multiprocessor  currently  being  built  at  the  Lawrence  Livermore  National 
Laboratory  consists  of  16  Mark  IIA  processors,  connected  together  with  a  crossbar  switch 
as  shown  in  Figure  1. 

A  crossbar  is  the  highest  possible  performance  interconnection  network,  with  a  direct 
logical  connection  from  each  processor  to  each  memory  bank.  Given  that  high  performance 
processing  elements  are  being  used,  the  cost  of  the  crossbar  switch  turns  out  to  only  be 
a  few  percent  of  the  system  cost,  making  it  the  obvious  choice  for  use. 
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To  the  programmer,   the  S-l  Multiprocessor  looks  like  16  identical  processors  executing 
out  of  a  very  large   (up  to  16  billion  bytes)  common  memory.     The  processors  always  get  the 
latest  value  associated  with  a  memory  location,  and  instructions  operate  in  a  read-modify- 
write  fashion.     All  of  the  complexity  of  moving  results  between  different  processors  and 
between  processors  and  memories  are  completely  handled  by  the  hardware  in  an  invisible 
fashion . 

In  order  to  speed  up  effective  memory  access  times,   the  processor  keeps  the  most  -  recent ly 
referenced  memory  locations  in  cache  memories,  which  are  very  high-speed  local  memories 
contained  inside  of  the  processors.     Each  processor  has  two  cache  memories,  one  which  keeps 
track  of  the  most  -  recently  referenced  64K  bytes  of  data,  and  one  which  keeps  track  of  the 
most  -  recently  referenced  16K  bytes  of  instructions.     When  a  processor  wants  to  use  a  location 
that   is  contained  in  one  of  the  caches,  the  effective  access  time  is  zero,  since  the  read  is 
overlapped  with  the  execution  of  the  previous  instruction.     When  the  processor  wants  to  use 
a  location  that  is  not  contained  in  one  of  the  caches,   the  processor  goes  out  to  the  main 
memory  to  see  if  the  desired  location  is  stored  there.     If  so,   it  reads  the  64  bytes  around 
the  location  of  interest,  and  stores  them  in  its  cache  for  future  use,  removing  the  64  bytes 
that  haven't  been  referenced  for  the  longest  period  of  time.     If  the  location  that  is 
desired  is  contained  in  the  cache  of  another  processor,   the  requesting  processor  will  ask 
the  processor  that  has  the  location  in  its  cache  to  remove  it  from  its  cache,  and  to  transmit 
the  data  to  the  requesting  processor. 

The  technique  by  which  the  hardware  automatically  keeps  track  of  shared  data  in  a  multi- 
processor with  caches  is  called  cache  coherence   [4].     Associated  with  each  block  of  64  bytes 
in  main  memory  are  an  additional   17-bits  that  specify  the  current  "ownership"  of  the  block. 
There  is  one  bit  for  each  of  the  16  processors   in  the  multiprocessor  which  is  set  if  the 
corresponding  processor  has  a  copy  of  the  block  in  its  cache,  and  the  17th  bit  says  that 
somebody  has  a  copy  of  the  block  for  write  access.     Multiple  processors  are  allowed  to  have 
copies  of  a  block  for  read  access,  but  only  one  processor  is  allowed  to  have  a  block  for 
write  access . 

The  use  of  shared  memory  to  provide  high  speed  synchronization  and  low  latency  data 
transmission   (less  than  a  microsecond)   is  difficult.     For  problems  which  require  very  close 
cooperation  between  the  processing  elements,  a  special  set  of  hardware  implemented  queue 
instructions  are  provided.     These  instructions  allow  one  processor  to  put  computed  results 
into  a  queue  for  another  processor,  which  takes  values  and  does  further  computation.  We 
have  found  that  this  can  substantially  speed  up  processing  in  some  algorithms. 

Operating  System  Support 

In  order  to  provide  a  workable  software  base  for  experimentation,   the  S-l  Project  has 
undertaken  the  development  of  a  new  operating  system,  called  Amber,   that  is   intended  to 
provide  a  flexible  interface  to  the  multiprocessing  capabilities  of  the  system. 

The  basic  design  goal  of  Amber  is  to  support  a  widely  varying  community  of  users  — 
including  real-time,  computation  intensive,  and  time-sharing  —  on  one  system.     We  see  this 
as  particularly  important  since  it  allows  for  extensive  sharing  of  effort,  both  in  the 
development  of  system  software  and  applications  software.     Often  several  operating  systems 
are  developed  for  new  computers,  one  for  each  major  class  of  application.     When  this  occurs, 
there  is  little  motivation  to  share  development  effort  between  the  different  operating 
systems.     Facilities  with  common  functions  are  implemented  multiple  times  with  different 
interfaces  for  each  operating  system.     This  not  only  increases  the  total  development  burden, 
but  also  limits  the  rate  at  which  the  system  matures,  since  a  smaller  user  group  is  avail- 
able to  test  out  the  system.     On  the  other  hand,  when  there  is  a  single  mul t i - f unct ion 
operating  system;   tools  such  as  compilers,  debuggers  and  file  maintenance  utilities  can  be 
readily  shared  between  different  applications.     More  important  is  the  fact  that  libraries 
developed  by  user  groups  can  be  shared  as  well.     There  is,  however,  a  danger  in  the 
development  of  a  mul t i - f unct ion  system;   the  system  may  not  fulfill  any  of  the  requirements 
well.     To  avoid  this  problem,  Amber  has  a  modular  layering  of  functions.     The  lowest  levels 
of  the  system  provide  only  atomic  functions  that  can  be  implemented  efficiently;  higher 
levels  of  the  system  incorporate  the  more  complex  functions  which  are  specific  to  particular 
applications.     Commonality  is  therefore  retained,  but  an  application  need  only  invoke  the 
functions  necessary  to  it. 

The  central  example  of  this  kind  of  layering  of  function  exists   in  the  scheduler.  The 
low  level  scheduler  provides  an  efficient  mechanism  for  short-term  scheduling  of  tasks  on  a 
single  processor.     The  basic  algorithm  is  a  simple  priority  scheduling  algorithm  with 
round-robin  queues.     Within  a  single  priority  level,  each  task  may  run  until  it  must  wait 
for  some  external  event,  or  an  assigned  run  quantum  expires,  at  which  time  it  is  moved  to 
the  end  of  the  queue.     Tasks  may  be  assigned  to  different  priority  levels  depending  on  their 
relative  importance  or  real-time  constraints.     For  example,   in  a  time-sharing  system  normal 
user  tasks  might  be  given  priority  over  batch  jobs,  and  relinquish  priority  to  tasks  which 
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must  respond  to  external   interrupts  in  a  certain  length  of  time.     The  high  level  scheduler 
implements  higher  level,  policy  oriented  scheduling  functions,  by  manipulating  the  param- 
eters of  the  low  level  scheduler,  such  as  task  priority  or  quantum  size.     The  simplest 
example  of  such  a  scheduling  policy  is  for  "real-time"  jobs.     Here  the  policy  is  simple, 
select  the  priority  that  the  job  is  to  have,  and  assign  it  to  a  processor.     More  complex 
policies  occur  in  batch  or  time-sharing  systems,  where  it  may  be  desirable  to  load-share 
across  all  the  processors  in  the  system  or  to  guarantee  a  particular  job  a  certain  fraction 
of  system  resources.     In  contrast  to  the  low  level  scheduler,  which  makes  assignments  on 
millisecond  timescales,  the  policy  decisions  are  made  on  second  or  minute  timescales  and 
can  therefore  be  relatively  expensive  without  unduly  affecting  system  response. 

The  low  level  scheduler  enforces  a  dedicated  processor  assignment  for  each  task  given 
to  it,   rather  than  scheduling  each  task  to  the  next  available  processor.     This  means  that 
processors  may  lie  idle  in  the  system  while  there  are  tasks  ready  to  run.     While  this  may 
seem  unfortunate  at  first  glance,   there  is   in  fact  strong  motivation  to  restrict  task  to 
single  processors  in  the  short  run.     First,  the   I/O  architecture  of  the  S-l  attaches 
peripherals  through  dual-port  memories  to  a  particular  processor.     A  task  whose  purpose  is 
to  control  a  peripheral  can  only  run  on  the  processor  to  which  the  peripheral  is  attached; 
the  task's  processor  assignment  must  reflect  this  fact.     Second,  the  internal  processor 
caches  are  very  large,  and  as  a  task  runs  it  builds  up  a  substantial  investment   in  data 
that  has  been  locally  cached.     If  its  execution  were  to  be  moved  to  a  different  processor, 
the  data  would  have  to  be  swapped  back  to  main  memory  and  then  swapped  into  the  cache  of 
the  new  processor.     Consequently,   if  tasks  are  moved  from  one  processor  to  another  on  a 
short  time  scale,  a  noticeable  performance  degradation  results.     By  performing  processor 
reassignment  on  a  relatively  long  time  scale,  the  effect  is  trivialized.     Third,  to  support 
parallelism  between  tasks  working  on  the  same  problem,   it  is  necessary  to  insure  concurrency 
of  execution.     By  assigning  each  such  task  to  its  own  processor,  we  can  do  so  without  use 
of  a  complex  algorithm  which  would  interfere  with  the  simple  requirements  of  other  applica- 
tions . 

One  of  the  important  features  of  the  S-l  multiprocessor  is  the  large  shared  memory  which 
permits  high  bandwidth  communications  between  tasks.     Access  to  the  shared  memory  is  provided 
in  two  ways:     sharing  of  entire  address  spaces,  and  sharing  of  specific  data  objects. 

The  address  space  of  a  task  encompases  all  data  to  which  it  has  access;   this  includes 
program  instructions,  common  blocks,  and  local  variables.     In  many  operating  systems,  each 
task  is  assigned  a  unique  address  space   (usually  called  a  core  image).     Each  task  sees  its 
own    private    copy  of  all  programs  and  data.     Modifications  made  are  not  apparent  to  other 
tasks.     In  Amber  it  is  possible  for  two,  or  more,   tasks  to  share  the  same  address  space, 
i.e.,   identically  the  same  physical  storage.     As  a  result,  a  modification  to  data  made  by 
one  task  is  immediately  visible  to  another,  even  if  the  tasks  reside  on  different  processors. 
Such  shared  data  may  be  used  as  semaphores  or  locks  to  synchronize  the  execution  of  the  tasks 
or  as  shared  data  bases  to  be  concurrently  processed  by  the  tasks. 

While  there  are  uses  for  sharing  entire  address  spaces   —   for  instance,   the  semantics 
of  Ada  require  tasks  to  execute  in  the  same  address  space,  there  is  added  protection  in  only 
sharing  that  portion  of  address  which  is  actually  common.     For  this,  Amber  implements 
segmentation.     A  segment  is  little  more  than  an  ordinary  file,  except  that  a  task  can  instruct 
that  the  file  be  mapped  into  its  address  space  so  that  it  may  be  directly  modified.  When 
two  tasks  both  map  the  same  segment,   they  share  a  single  physical  copy,  while  other  portions 
of  the  address  space  stay  private.     Thus,  a  task  is  protected  against  inadvertent  modifica- 
tions to  its  private  state.     The  segmentation  mechanism  can  provide  further  intertask 
protection  as  well.     The  modes  of  access   (read,  write)   that  a  task  is  permitted  to  a  segment 
is  the  same  as  the  normal  file  protection.     It  is  then  possible,   for  example,   to  set  up  — 
and  enforce  —  a  reader/writer  relationship  between  two  tasks  by  granting  one  read/write 
access  on  the  segment,  and  the  other  only  read  access. 

The  inherent  redundancy  of  the  multiprocessors  is  often  used  to  obtain  increased  relia- 
bility and  availability.     Amber  uses  a  facility  called  dynamic  reconfiguration  to  exploit 
the  redundant  components  of  the  S-l  multiprocessor.     At  any  time,  Amber  is  capable  of  pro- 
viding service  with  only  a  partially  operating  configuration,  and  it   is  possible  to 
dynamically  change  the  configuration  without  halting  the  system.     If  a  memory  box  is  to  be 
removed,  data  in  that  box  is  moved  either  to  another  box  or  to  disk  and  the  virtual  memory 
mapping  updated  to  reflect  is  new  location.     If  a  processor  is  to  be  removed,   tasks  on  that 
processor  are  halted  and  redistributed  to  other  processors  for  execution.     When  a  memory 
or  a  processor  is  added,   it  is  added  to  the  pool  of  system  resources  and  is  assigned  as 
needed . 

A  Multiprocessor  Software  Tool 

The  construction  and  maintenance  of  large  application  programs  for  a  multiprocessor 
system  presents  many  problems.     The  details  of  the  multiprocessor  system  may  greatly 
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influence  the  structure  of  the  software  that  yields  the  best  performance.     For  instance, 
the  algorithm  that  is  fastest  on  a  uniprocessor  may  not  exploit  the  capabilities  of  a 
multiprocessor  as  well  as  one  tailored  for  parallelism.     In  addition,  the  number  of  proces- 
sors available  on  a  time-shared   (or  gracefully  degrading)  multiprocessor  may  vary  with 
time,  with  different  algorithms  being  appropriate  for  different  load  factors  and  numbers 
of  processors. 

Thus,   it   is  desirable  to  maintain  a  single  source  which  works  well  on  many  configurations. 
However,   including  too  many  details  of  how  to  best  perform  a  task  may  lead  to  unreadable, 
unmodif iable ,  and  unt ransportable  programs.     The  approach  taken  in  the  Paralyzer,  a  tool 
being  developed  by  the  S-l  Project,   is  to  split  programs  into  two  conceptual  pieces.  One 
part  describes  the  "how"  of  the  computation,  which  includes  the  basic  data  and  control  flow 
of  the  algorithm.     The  second  part   is  the  "where"  of  the  computation.     This  basically 
specifies  what  processors  are  to  run  the  computations,  as  well  as  modifying  the  control  and 
data  flow  of  the  first  section  in  ways  appropriate  to  the  hardware  available.     The  intent 
is  that  the  first  section  is  relatively  machine  and  configuration  independent,  whereas  the 
second  is  completely  driven  by  the  computation  resources. 

The  current  version  of  the  Paralyzer  uses  Pascal  as  the  source  language,  and  is  implemented 
as  a  source  -  to  -  source  translator.     Special  Pascal  comments  are  used  to  describe  some  of  the 
"where".     The  special  comments  and  the  "where"  description  file  are  implemented  in  Maclisp 
(a  variant  of  Lisp)   —  thus  a  complete  programming  language  is  available  for  program  manipu- 
lation.    A  library  of  routines  have  been  written  in  Maclisp  to  implement  the  most  common 
kinds  of  transformations. 

A  simple  example  of  a  transformation  performed  by  the  Paralyzer  involves  partitioning 
the  processing  of  a  matrix  among  several  processors.     The  directives  in  the  special  comments 
instruct  that  the  matrix  be  divided  into  several  equal  sections,  along  columnar  boundaries, 
and  that  each  section  be  given  to  a  separate  processor  for  parallel  execution.     Such  a 
transformation  is  feasible  when  the  computation  of  a  single  matrix  element  depends  only  on 
other  elements  in  the  same  column;   not  on  elements  in  the  same  row.     This  restriction  insures 
that  a  single  processor  has  all  the  data  it  needs  to  perform  its  part  of  the  computation. 
When  other  dependencies  exist  in  data,  other  more  complicated  transformations  are  called  for. 

In  addition  to  generating  code  for  uniprocessor  and  multiprocessor  systems,  the  Paralyzer 
has  been  used  to  generate  code  for  simulation  purposes.     For  instance,   instead  of  generating 
variable  references,  the  Paralyzer  can  generate  calls  to  routines  that  allow  the  cache  and 
memory  performance  of  the  algorithm  to  be  determined. 

Summary 

The  S-l  Multiprocessor  System  is  a  new  step  in  the  development  of  high-performance  computer 
systems.     It  combines  many  cost  effective  supercomputers  with  the  interconnection  hardware 
and  software  to  effectively  utilize  those  processors  on  a  single  problem.     The  goal  is  to 
provide  the  capability  to  solve  real  problems  of  interest  to  real  users. 
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Abstract 

A  MIMD  architecture  suitable  for  the  Flow  Model  Processor  (FMP)  of  the  Numerical  Aero- 
dynamic Simulator  (NAS)  Processing)  System  (NPS)  has  been  described  to  NASA,  a  result  of 
extensive  studies  and  evaluations.  '  The  FMP  architecture  is  targeted  to  support  a 
throughput  in  excess  of  1000  Mflop/sec  over  a  range  of  applications.  This  paper  summarizes 
the  architecture  and  describes  the  strategies  adopted  for  making  this  many-processor  mul- 
tiprocessor controllable  and  efficient.  The  key  language  components  which  allow  efficient 
control  of  the  system  by  the  user,  while  at  the  same  time  supporting  straight-forward 
application  definition,   will  be  described. 

Background 

In  1977,  work  began  on  a  series  of  studies  and  evaluations  to  determine  whether  the 
design  of  a  "Flow  Model  Processor"  (FMP),  which  could  achieve  a  sustained  computational 
rate  of  one  billion  floating  point  operations  per  second  on  complete  computational  aero- 
dynamics problems  and  other  applications  of  interest,  was  feasible  and,  if  so,  what  that 
design  would  entail.  The  studies,  whose  results  are  partially  reported  here,  were  funded 
in  part  by  NASA  and  in  part  by  Burroughs  Corporation.  These  studies  concluded  that  the  use 
of  a  tightly-coupled  MIMD  (Multiple-Instruction,  Multiple-Data  stream)  architecture  would 
not  only  capture  the  concurrency  utilized  by  the  current  vector  processors,  but  that  such  a 
MIMD  approach  would  be  able  to  utilize  additional  forms  of  concurrency  inherent  in  the 
applications  of  interest.  This  MIMD  architecture  would,  therefore,  open  new  opportunities 
for  research  in  numerical  methods. 

Historically,  vector  processing  has  been  app^oacheg  in  two  ways  in  order  to  achieve 
higher  system  throughput:  pipelines  and  arrays.  '  '  '  While  these  computational  techniques 
have  been  developing,  the  problems  in  aircraft  design  seem  to  be  getting  progressively  more 
difficult.  Physical  experiments  require  the  use  of  wind  tunnels  (themselves  a  scarce 
resource  and  very  expensive  to  duplicate.)  These  constraints  have  motivated  the  develop- 
ment of  sophisticated  computational  techniques  to  reduce  the  amount  of  physical  modeling 
required  and  to  improve  the  quality  of  the  design  studies.  The  objectives  for  the  FMP  were 
originated  based  on  the  memory  capacity  and  performance  needed  to  study  complex  physics  and 
whole  aircraft  configurations.  As  the  project  evolved,  additional  applications  were  con- 
sidered . 

Because  the  results  of  future  research  in  numerical  methods  can  not  be  predicted,  a  gen- 
eral purpose  system  is  needed  in  order  for  it  to  remain  useful  over  an  extended  period. 
The  resulting  stated  objective  is  to  be  able  to  solve  computational  aerodynamic  problems 
which  require  40M  words  of  main  memory  at  a  sustained  computational  rate  in  excess  of  1000 
Mflops.     Goals   for  some  other  applications  of  interest  were  also  set. 

Introduction 

Once  the  NPS  is  available,  designers  will  have  the  option  of  physical  or  computational 
experiments.  Figure  1  highlights  these  approaches  and  shows  the  steps  which  must  be  fol- 
lowed in  order  to  perform  useful  computational  experiments.  The  details  of  conducting  phy- 
sical experiments   (such  as  tests   in  wind  tunnels)   are  not   shown  in  this  figure. 
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The  computational  facility  to  be  developed  will  augment  such  physical  experiments.  This 
facility  will  be  used  through  a  process  of  abstraction,  simulation,  and  interpretation. 
First,  the  physical  model  will  be  abstracted  in  order  to  define  a  corresponding  mathemati- 
cal model.  Then,  the  simulation  capabilities  of  the  system  will  be  used  to  perform  a  com- 
putational experiment.  Finally,  the  results  of  the  simulation  can  be  interpreted  by  the 
user  to  determine  the  predicted  physical  conditions.  The  interpretation  process  is 
interactive  and  will  utilize  graphics  to  display  simulation  results  to  the  user. 

In  order  to  support  the  initial  step,  abstraction  of  the  physical  model,  a  set  of  exten- 
sions to  the  FORTRAN  77  language  have  been  defined.  These  extensions  are  designed  to  sim- 
plify abstraction  of  the  application  into  mathematical  form  and  to  allow  the  user  to  have 
efficient  control  of  the  system.  The  resulting  model  is  a  program  which,  when  executed, 
simulates  the  system  being  studied  and  allows  various  computational  experiments  to  be  per- 
formed. The  applications  have  a  large  amount  of  numeric  computations,  a  considerable 
amount  of  inherent  concurrency  and  simple  control  structures.  From  a  modeling  point  of 
view,  the  applications  are  concerned  with  the  geometries  pertinent  to  the  analysis  of  the 
model,  the  problem  state  throughout  the  model,  and  the  definition  and  control  of  the  con- 
currency inherent   in  the  model. 

The  resulting  design  of  the  Flow  Model  Processor  (FMP)  is  a  MIMD  (Multiple  Instruction, 
Multiple  Data  stream)  architecture.  Special  features  have  been  incorporated  so  that  a  sin- 
gle program  can  be  issued  by  the  compiler.  All  processors  cooperatively  execute  that  sin- 
gle application  for  that  single  user  and  then  proceed  to  the  next  job. 

The  following  sections  first  provide  a  brief  overview  of  the  system.  Then,  the  impor- 
tant new  language  constructs  will  De  described.  Finally,  the  approach  which  allows  effec- 
tive and  efficient  control  of  such  a   large  computational  system  will  be  introduced. 

System  overview 

In  order  to  understand  the  power  of  the  software  constructs,  the  system  which  evaluates 
them  must  be  understood.  The  discussion  below  summarizes  the  important  characteristics  of 
the  Flow  Model  Processor  portion  of  the  system  from  a  hardware  point  of  view.  Software 
issues  will  be  summarized   in  the  next  section. 

Figure  2  shows  a  simplified  block  diagram  of  the  proposed  multiprocessor.  Each  of  the 
128  processors  has  its  own  program  counter,  its  own  local  memory  for  data,  and  its  own, 
independent  attachment  to  the  Connection  Network.  The  Connection  Network  (CN)  provides 
access  to  a  common  Extended  Memory  (EM)  and  a  staging  memory  called  the  Data  Base  Memory 
(DBM).  Internally,  each  processor  is  highly  concurrent  with  a  number  of  independently  exe- 
cuting  functional  units  operating  on  64  bit  data  words. 


(PCMN) 


PROCESSOR 
CONTROL 
& 

MAINTENANCE 
NETWORK 


•Job  &  Control  Info 


DATA  BASE  MEMORY 


PROCESSOR 

LOCAL  MEMORY 

(CN) 

MEMORY 

E 

CN  BUFFER 

MODULE 

X 

T 

PROCESSOR 

LOCAL  MEMORY 

MEMORY 

E 

CN  BUFFER 

MODULE 

N 

D 

• 

CONNECTION 

• 

E 
D 

• 
• 

NETWORK 

• 

1  • 

M 
E 
M 

PROCESSOR 

LOCAL  MEMORY 

MEMORY 

0 

CN  BUFFER 

MODULE 

R 

Y 

■to  files 


Figure  2 


FMP  Architecture  concept 
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The  processors,  the  modules  of  the  EM,  and  the  DBM  are  interconnected  with  the  Connec- 
tion Network  (CN),  a  circuit^switched  network  which  has  distributed  local  routing  controls. 
Figure  3  shows  this  network.  The  CN  allows  each  processor,  without  regard  to  any  other 
processor,  to  make  independent  requests  for  connection  to  and  service  by  any  memory  module. 
While  conflicts  are  possible  (due  to  collisions  in  the  network  and  to  simultaneous  attempts 
to  make  use  of  the  same  memory  module  by  different  processors),  the  extensive  simulations 
performed  to  date  have  shown  that  the  expected  degradation  of  performance  due  to  collisions 
in  the  network  is  not  a  performance  bottleneck. 

PROCESSOR  PORTS 


(0111 ) 
MEMORY  PORTS 

where  S  =  2x2  Crossbar  Switch    (locally  controlled) 


Figure  3.     Connection  network  with  local  control  of  path  select 

All  highly  concurrent  problems  require  the  various  program  parts  to  synchronize  from 
time  to  time.  When  this  operation  involves  128  processors,  the  use  of  classical  software 
handshaking  or  the  use  of  read-modify-write  operations  on  memory  locations  can  easily 
become  a  bottleneck.  In  this  FMP  design  concept,  a  special  hardware  network  (Processor 
Control  and  Maintenance  Network,  PCMN)  will  accept  the  "I  GOT  HERE"  signals  from  the  pro- 
cessors and,  when  all  signals  are  received,  will  produce  and  broadcast  a  "GO"  command  back 
to  all  of  the  processors. 
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Other  important  capabilities  are  included  in  this  FMP  design  concept, 
common  code  memory,  fault  tolerance  via  error  correcting  codes  in  the 
degradation,    and  system  configuration  modularity. 


They  include  a 
memories,  graceful 


Software  concepts 

Many  software  related  issues  were  considered  during  the  development  of  the  FMP  design. 
The  operating  system  was  simplified  through  consideration  of  the  FMP  as  an  attached  proces- 
sor to  an  existing  commercial  processor   (which  has  a  mature  operating  system  already.) 


Key  language  concepts .  In  order  to  effectively  control  the  FMP  just  described  and, 
perhaps  more  importantly,  in  order  to  ease  the  task  of  describing  models  and  applications 
in  a  form  suitable  for  execution  on  the  FMP,  a  few  simple,  but  powerful,  language  concepts 
were  developed.  These  concepts  have  been  embedded  as  extensions  to  the  ANSI  FORTRAN  77 
language.  The  key  language  features  are  the  DOALL,  the  COALL,  and  input-output  management. 
The  form  of  these  constructs  and  their  usage  will  be  introduced  in  the  following  sections. 


DOALL .  The  most  important  language  concept  is  the  DOALL.  The  DOALL  is  conceptually 
very  similar  to  the  DO  loop  in  standard  FORTRAN,  except  that  each  iteration  instance  of  a 
DO  loop  will  execute  sequentially  while  each  iteration  instance  of  a  DOALL  will  execute 
concurrently  (if  sufficient  processing  resources  are  available.)  The  body  of  code  of  a  DO 
loop  is  executed  first  for  the  initial  value  of  the  loop,  then  for  the  next,  the  next,  etc 
(sequentially,  each  instance  in  order).  On  the  other  hand,  in  the  DOALL,  the  instances  of 
the  code  body  can  be  simultaneously  executed  by  each  of  the  processors,  with  each  processor 
evaluating  the  code  for  a  different  value  of  the  loop  index  than  the  others.  Because  each 
processor  is  executing  independently  of  the  others,  local  evaluation  of  conditionals  can  be 
processed  immediately  without  coordination  with  the  other  processors.  The  body  of  code 
evaluated  in  the  DOALL  can  contain  DO  loops,  assignment  statements,  I/O  commands,  condi- 
tionals, subroutine  calls,  GO  TO's,  and  intrinsics.  Upon  completion  of  its  instance,  the 
processor  checks  to  see  if  any  other  instances  need  execution  (a  condition  which  occurs 
when  there  are  more  instances  to  be  evaluated  than  available  processors).  If  there  are, 
the  processor  proceeds.  If  not,  an  "I  GOT  HERE"  signal  is  asserted  by  the  processor.  The 
processor  then  waits  for  the  "GO  "  signal  broadcast  by  the  Processor  Control  &  Maintenance 
Network   (PCMN) . 


COALL .  While  the  DOALL  can  be  used  to  describe  all  inherent  concurrency  in  a  model,  a 
second  form  was  introduced  to  simplify  the  description  of  another  form  of  concurrency. 
When  a  COALL  block  is  established  surrounding  a  sequence  of  statements  or  blocks,  each  of 
the  statements  or  blocks  will  be  evaluated  by  a  different  processor.  This  language  form  is 
useful  in  identifying  initialization  sequences  which  are  mutually  independent  and  which  -can 
be  evaluated  concurrently  on  different  processors.  The  COALL  also  provides  a  means  for 
initiating   simultaneous  execution  of  different  processes. 


Input/Output .  In  order  for  the  FMP  to  be  useful,  it  must  accept  inputs,  process  them 
appropriately  and  generate  outputs.  Because  of  the  speed  with  which  the  FMP  utilizes  input 
and  produces  output,  the  FMP  itself  must  support  formated  and  direct  I/O.  A  series  of 
extensions  to  I/O  processing  concepts  have  been  defined  to  allow  the  FMP  to  proce'ss  format- 
ted I/O  as  efficiently  as  other  computations.  The  concept  is  to  allocate  portions  of  the 
I/O  buffers  explicitly  to  the  various  instances.  In  this  way,  many  processors  can  be  con- 
currently processing  I/O  records  from  different  portions  of  the  same  I/O  buffer.  In  addi- 
tion, a  form  of  "time-stamped"  output  has  been  defined  so  that  the  physical  output  will 
appear  in  the  correct  chronological  order  of  generation,  independent  of  which  physical  pro- 
cessors produced  the  output. 


Because  the  DOALL  and  COALL  are  high-level  concurrency  constructs,  parallel  forms  of 
standard  intrinsics  are  not  required.  For  example,  if  a  SIN  intrinsic  is  executed  within 
the  body  of  a  DOALL,  the  DOALL  will  cause  execution  of  the  SIN  intrinsic  in  each  of  the 
instances.  The  execution  of  SIN  in  the  various  instances  may,  or  may  not  be  concurrent, 
depending  on  the  flow  of  execution  in  each  instance.  No  special  parallel  SIN  intrinsic  is 
needed.  Some  special  parallel  intrinsics  are  defined.  SUMALL  and  MAXALL  are  examples  of 
them . 


These  I/O  concepts,  and  the  DOALL  and  COALL  concurrency  constructs  extend  FORTRAN  77  to 
form  a  very  powerful  modeling  support  language  and  allow  rather  direct  control  of  the  MIMD 
architecture  from  the  high-level  language.  The  next  section  explains  how  the  language  maps 
onto  the  hardware  for  this  control  to  exist. 
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MIMD  structure  control 


The  language  constructs  just  described  map  onto  the  FMP ' s  MIMD  architecture  in  such  a 
way  as  to  simplify  control  of  the  FMP  and  to  reduce  the  complication  of  its  implementation. 
The  compiler  processes  the  FMP  FORTRAN  language,  identifying  the  new  constructs,  DOALL  and 
COALL.  Code  generated  by  the  compiler  defines  the  start  and  completion  of  these  con- 
currency structures.  In  addition,  the  compiler  establishes  the  necessary  pointers  to  I/O 
buffers   so  that  concurrent  evaluation  of   formated  I/O  can  proceed. 

Assignment  of  specific  instances  to  specific  processors  cannot  be  accomplished  effec- 
tively by  the  compiler;  those  decisions  are  often  data  and  configuration  dependent.  There- 
fore, the  compiler  emits  code  at  the  beginning  and  end  of  DOALL  and  COALL  blocks  which, 
when  executed,  will  perform  the  allocation  of  instances  and  which  will  identify  the  point 
at  which  synchronization  should  occur.  The  remaining  job  of  the  compiler  is  straight- 
forward since  all  that  remains  of  the  compilation  task  is  to  compile  standard,  sequential 
FORTRAN  statements  onto  a  serial  processor.  This  compilation  process  can  utilize  existing 
compiler  technology  effectively. 

The  allocation  of  instances  to  processors  is  potentially  very  complicated.  If  a  goal  of 
minimizing  processor  wait  time  were  established,  a  considerable  amount  of  linear  program 
evaluation  would  go  into  scheduling  and  rescheduling  instances.  This  project  has  taken  a 
rather  pragmatic,  but  realistic  approach.  The  majority  of  applications  were  observed  to 
have  the  property  that  the  amount  of  computation  in  each  instance  (or  group  of  instances) 
are  approximately  the  same.  Thus,  a  simple,  non-optimizing,  distributed  algorithm  for 
allocating  instances  was  chosen.  This  algorithm  is  evaluated  on  each  processor  after  each 
processor  "knows"  its  assigned  processor  number,  the  total  number  of  instances,  and  the 
total  number  of  active  processors.      The  algorithm  is  as  follows: 


[1]     Set  Instance  Number   (IN)   =  Processor  Number 


[2]     If  IN>Max  Number  of   Instances,    then  exit  by   "I  GOT  HERE "   and  wait  for  "GO";   ELSE  con- 
tinue to  [ 3 ] . 

[3]     Evaluate  code  in  DOALL  or  COALL  block.    (Note  that  Instance  Number  will  relate  directly 
to  the  loop  variable  value) 

[4]     Increment  the  Instance  Number  by  the  Max  Number  of  Processors   and     return     to     [2]  in 
order  to  check  if  additional  work  needs   to  be  performed. 


Note  that  this  algorithm  supports  the  desire  for  graceful  degradation  of  such  a  mul- 
tiprocessor. If  a  processor  fails,  a  processor  renumbering  and  broadcast  of  the  new  Max 
Number  of  Processors  is  all  that  is  necessary  in  order  for  this  distributed  algorithm  to 
automatically  redistribute  the  instances.  Simulations  have  shown  this  simple  distributed 
algorithm  and  approach  to  control  of  work  on  the  FMP  to  be  effective  on  the  problems  of 
interest.  Other  algorithms  can  be  used,  simply  by  changing  the  code  emitted  by  the  com- 
piler  in  response  to  DOALL  and  COALL  statements. 


Conclusion 


A  tightly-coupled  MIMD  architecture  for  h igh- throughput  numerical  computation  has  been 
presented.  The  introduction  of  programming  constructs  which  simplify  the  description  of 
the  inherent,  high-level  concurrency  in  models  and  which  allow  concurrent  formated  I/O  pro- 
cessing should  make  an  FMP  based  on  these  concepts  widely  applicable. 
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Abstract 

This  paper  presents  a  mathematical  technique  for  the  allocation  of  software  tasks  to 
the  elements  of  a  distributed  signal  processor.     The  method  exploits  the  regularity  and 
periodicity  of  signal  processing  software  to  yield  optimal  task-to-processor  allocations. 
Although  the  method  permits  the  use  of  various  performance  criteria,   emphasis  is  placed  upon 
minimizing  the  time  of  completion,   or  "cycle  time,"  of  the  signal  processing  software.  The 
problem  formulation  includes  inter- task  communication,   task  memory  requirements,   and  time 
constraints  on  task  execution.     Finally,   the  paper  presents  and  analyzes  a  realistic  soft- 
ware allocation  problem.  ** 

Introduction 

The  design  of  real-time  distributed  signal  processors  is  a  field  ripe  for  the  application 
of  practical  optimization  techniques.     Over  the  past  decade  a  variety  of  such  techniques  has 
been  developed  and  exploited  in  the  context  of  relatively  generic  distributed  computer 
systems.     However,    there  have  been  few  results  published  on  the  application  of  these  tech- 
niques to  distributed  signal  processing  computers  [1]. 

In  general,   signal  processing  requirements  are  more  rigorously  structured  than  the  re- 
quirements of  other,  more  generic  systems.     For  example,   signal  processors  execute  a  series 
of  predetermined  software  tasks,   each  of  which  operates  on  a  set  of  well-defined  input  data. 
Furthermore,   the  signal  processing  software  task  structure  generally  possesses  well-defined 
linear  precedence  relationships,   as  is  illustrated  in  Figure  la.     Such  a  structure,  which 
stands  in  marked  contrast  to  more  general  precedence  relationships  as  shown  in  Figure  lb, 
represents  a  relatively  simple  task  scheduling  problem.     Each  task  in  the  set  cannot  execute 
before:     a)   the  task  directly  before  it  completes  execution;   and  b)    inter-task  data  transfer 
(if  any)   occurs.     Thus,   scheduling  of  the  tasks  is  subject  to  combinatorial  problems  of 
lesser  magnitude  than  those  which  so  seriously  affect  the  computational  complexity  of  more 
intricately  structured  software  [2-6]. 

The  repetitive  nature  of  signal  processing  computational  requirements  is  yet  another  area 
which  may  be  effectively  exploited  by  optimization  techniques.     Here,   a  typical  signal  pro- 
cessor is  required  to  execute  repeatedly  the  same  software  on  different  sets  of  data.  The 
data  sets  flow  into  the  system  and  are  processed  sequentially  by  the  software.  Typically, 
the  goal  of  the  signal  processing  computer  designer  is  to  develop  a  system  which  permits 
execution  of  software  in  as  short  a  "cycle  time"  as  possible.     This  goal  places  several 
powerful  restrictions  on  the  hardware  to  which  the  tasks  can  be  assigned.     Such  limits  can 
be  interpreted  as  restricting  the  software-to-hardware  allocation  matrix  and  further  de- 
creasing the  complexity  of  the  problem.     In  most  signal  processing  environments  the  degree 
of  sparsity  of  the  allocation  matrix  is  quite  high;   a  factor  that  can  be  exploited  by  the 
designer . 

The  goal  of  this  paper  is  to  present  a  methodology  which  can  be  applied  to  a  wide  variety 
of  distributed  signal  processing  design  problems  into  which  some  degree  of  sparsity  has  been 
introduced.     The  technique  described  is  based  upon  spatial  dynamic  programming    (SDP)    and  is 
capable  of  accommodating  a  wide  variety  of  design  constraints  including:     upper  limits  on 
software  cycle  times,   limits  on  memory  capacities,   and  inter-task  communication.     The  tech- 
nique can  also  be  adapted  to  situations  in  which  the  objective  is  other  than  the  minimiza- 
tion of  cycle  time.     Such  an  objective  might  be,    for  example,   the  minimization  of  inter- 
processor  communication,   a  factor  relating  directly  to  the  minimization  of  bus  contention. 

The  Software  Allocation  Problem 

The  trend  toward  the  development  and  application  of  distributed  signal  processors  re- 
quires the  efficient  utilization  of  system  resources.     That  is,   the  requirements  of  the 
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software  must  be  matched  with  the  capabilities  of  the  distributed  hardware.     The  better  this 
matching  is  achieved,   the  further  the  distributed  hardware's  potential  advantages  (e.g., 
high  reliability,    high  throughput)    are  exploited. 

In  this  paper,  we  consider  a  signal  processor  as  being  composed  of  individual  "processing 
elements,"  or  PEs,   each  of  which  is  essentially  an  autonomous  computer  consisting  of  four 
types  of  elements:     an  arithmetic  unit,   a  limited  capacity  memory,   one  or  more  I/O  ports, 
and  a  control  unit,   as  shown  in  Figure  2.     Inter-PE  communication  occurs  over  busses  to 
which  all  PEs  are  connected.     The  model  developed  is  for  a  fully  connected  signal  processor, 
that  is,   one  in  which  every  PE  can  communicate  with  every  other  PE.     Such  a  system  is 
illustrated  in  Figure  3.     Unlike  previous  formulations   [1],   however,   the  model  does  not  re- 
quire the  PEs  to  be  identical. 


© 

9 
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Figure  la.     Three  tasks  illustrating  Figure  lb. 

a  linear  precedence  relationship 
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Figure  2.     Components  of  a  Processing  Element    (PE) . 

The  task  allocation  problem  can  be  posed  as  follows:      "Given    (1)    a  system  of  distributed 
processing  elements;   and    (2)    a  set  of  software  tasks  to  be  executed  on  the  system,   what  is 
the   'best'   allocation  of  tasks  to  processors"?     The  word  "best,"  although  difficult  to  de- 
fine precisely,   is  generally  synonymous  with  the  minimization  of  software  "cycle  time." 
The  allocation  of  tasks  must  usually  be  made  under  physical  resource  and/or  temporal  con- 
straints . 


Although  much  work  has  gone  into  software  allocation  problems,  most  of  it  has  been  per- 
formed in  the  context  of  generic  distributed  computer  systems   [2-9],    not  the  relatively 
specialized  systems  of  distributed  signal  processors.     Only  recently  have  attempts  been  made 
to  apply  task  allocation  methodologies  to  distributed  signal  processors.     Glass  and 
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Figure  3.     A  typical    (single  bus)    fully  connected  signal  processor. 

Kotiveeriah  have  developed  a  heuristic  algorithm  tailored  to  the  specific  needs  of  such 
systems   [1].     The  authors  exploit  the  typically  deterministic  nature  of  the  signal  processing 
environment  to  produce  efficient  task-to-PE  allocations.     It  is  the  work  of  Glass  and 
Kotiveeriah  which  provided  the  stimulus  for  the  adaption  of  the  present  method  of  task  allo- 
cation to  signal  processors. 

Our  formulation  of  the  allocation  problem  incorporates  several  realistic  features  of  sig- 
nal processing  software: 

1.  inter-task  communication, 

2.  task  memory  requirements,  and 

3.  time  constraints  on  task  execution. 

Each  of  these  formulation  elements  is  discussed  below. 

One  of  the  main  problems  with  distributing  tasks  in  a  signal  processor  is  that  communica- 
tion requirements  are  increased  over  what  they  would  be  in  a  centralized  system.     For  exam- 
ple,  if  two  tasks  which  must  communicate   (send  and/or  receive  data)    lie  in  different  PE ' s , 
then  communication  overhead  is  incurred.     In  addition  to  increasing  the  effective  execution 
time  of  the  tasks,   this  overhead  draws  upon  the  bus  resources  thereby  increasing  bus  con- 
tention . 

Another  feature  of  the  task  allocation  problem  which  is  modelled  is  the  memory  required 
by  the  executable  code  of  the  tasks.     If  the  PE's  memories  are  of  unlimited,   or  effectively 
unlimited,   capacity  then,   of  course,   the  memory  requirements  of  the  tasks  do  not  affect 
possible  task-to-PE  allocations.     However,   more  realistically,   each  PE  possesses  a  finite 
amount  of  memory  space,   therefore  limiting  the  possible  allocations.     Our  formulation  per- 
mits the  modelling  of  PEs  of  limited  memory  capacity. 

Finally,   the  model  presented  in  this  paper  permits  temporal  constraints  to  be  placed  on 
task  execution.     Throughput  of  distributed  signal  processors  is  a  direct  function  of  the 
tasks'   cycle  time.     Therefore,   by  setting  limits  on  this  parameter  allocations  which  satisfy 
it  are  found. 


We  formulate  the  allocation  problem  with  an  integer  programming  representation.  Here, 
rariab] 

|  1     if  task     i     is  assigned  to  PE  k 


the  decision  variable     X.,      is  defined  as 

lk 


ik 


0  otherwise 


Also,   we  define 


fcik 


=  execution  time  of  task     i     on  PE  k 


C\  "k£  =  communication  time  for  tasks     i     and     j     when     i     is  assigned  to  PE     k  and 
1-'  j     is  assigned  to  PE  I 


-  memory  requirement  of  task  i 
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M^  =  capacity  of  the  memory  associated  with  PE  k. 

While  the  solution  method  permits  the  use  of  many  performance  criteria,   we  choose  to 
minimize  the  tasks'   cycle  time.     This  is  equivalent  to  minimizing  the  maximum  PE  finishing 
time.     The  problem  can  then  be  written  as 

Minimi ze  T  =  Max  I    [  X . ,  t . .    +     £       I     C..,„  X.,X..] 
,  lk  lk        ...    „,,      iikl     lk  -\l 

subject  to: 

1.     I  m.    X..    <   M.    for  all  k    (memory  constraint), 
1     lk  —     k  J 

l 


2.     Z  X       =  1    (all  tasks  must  be  assigned  once),  and 
k  lK 


3.     X.,    =  0  or  1    (binary  allocation  variables). 

With  this  formulation,    the  allocation  problem  will  be  solved  using  spatial  dynamic  pro- 
gramming . 

Summary  of  the  Software  Allocation  Method 

In  this  section  we  present  a  technique  for  solving  the  problem  of  assigning  software  tasks 
to  the  processing  elements  comprising  a  distributed  signal  processor.     The  method  used, 
spatial  dynamic  programming    (SDP) ,   is  a  method  for  solving  complex,   nonlinear,  constrained 
optimization  problems  which  exhibit  a  form  of  sparsity.     The  theoretic  basis  of  SDP  is  pre- 
sented in   [10]  while    [11,   12]  contain  details  regarding  the  specific  application  of  the 
method  to  the  generic    (i.e.,   non-signal  processor)    task  allocation  problem.     In  this  section 
we  summarize  the  general  technique  while  referring  readers  to  the  mentioned  references, 
particularly   [10],    for  more  detailed  information. 

The  task  allocation  problem  is  formulated  as  a  network  of  nodes  and  links.     The  formula- 
tion is  intended  to  be  figurative  only  and  should  not  be  confused  with  the  signal  processor's 
network  of  processing  elements.     In  the  SDP  network  each  node  has  a  set  of  variables  associ- 
ated with  it.     A  link  between  two  nodes  indicates  that  the  nodes  share  a  common  variable. 
Nodes  are  processed  one  at  a  time,   as  stages  in  a  dynamic  programming  setting,   and  a  trace- 
back  through  the  stages    (again,   as  in  dynamic  programming)    is  performed  to  find  the  optimal 
solution . 

The  SDP  methodology  developed  can  be  used  to  solve  a  very  general  formulation  of  the  task 
allocation  problem.     The  model  incorporates: 

1.  tasks  with  arbitrary  processing  and  memory  requirements, 

2.  processing  elements  with  arbitrary  throughputs  and  memory  capacities, 

3.  inter-task  communication, 

4.  many  different  performance  criteria,    including  minimizing  the  software  cycle  time, 
and 


5.     upper  bounds  on  cycle  time. 


For  the  allocation  of  tasks  to  PE's,   there  are  two  types  of  nodes  in  the  SDP  network. 
One  type  represents  PE's  while  the  other  represents  tasks.     A  link  connecting  a  PE  node  to 

a  task  node  indicates  that  the  task  may  be  allocated  to  the  PE.     Associated  with  each  (fig- 
urative)  PE  node  are  the  values  of  throughput  and  memory  capacity  pertaining  to  the  actual 
PE.     In  most  signal  processor  design  problems  the  resulting  SDP  network  is  sparse  with  each 
task  capable  of  being  allocated  to  only  a  few  PE's.     It  is  the  sparsity  which  is  exploited 
by  the  SDP  method. 

To  solve  the  problem,  we  require  information  on: 


1.  number  of  instructions  for  each  task, 

2.  memory  requirement  of  each  task, 

3.  throughput  of  each  PE's  arithmetic  unit,  and 

4.  capacity  of  each  PE's  memory. 
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The  SDP  methodology  permits  the  time  constraints  imposed  on  the  tasks  to  take  different 
forms.     For  example,   each  task     i     typically  has  to  finish  before  some  deadline     d..  For 
any  set  of  tasks  it  is  easy  to  determine  whether  all  the  upper  bound  constraints  can  be  met. 
To  do  this  the  tasks  should  be  ordered  with  respect  to  their  deadlines.     If,   with  this 
ordering,   some  deadlines  are  violated,   then  some  will  be  violated  with  any  ordering  and  not 
all  of  the  tasks  can  meet  their  deadlines. 

A  different  form  of  time  constraint  is  particularly  relevant  since  signal  processing  tasks 
have  to  execute  periodically.     Suppose  task     i,  with  execution  time     e.,  must  execute  once 
every     t.     time  units,   i=l,l,...,n.     While  the  actual  scheduling  of  the  tasks  may  be  difficult, 
task     i  "'"requires  a  fraction     e^/t^     of  the  PE '  s  available  time.  If 

n 

X     e./t.    >  1, 
i=l     1  1 

then  not  all  tasks  can  execute  as  frequently  as  they  should,   and  some  task  time  constraints 
will  be  violated. 

Yet  another  form  of  time  constraint  is  related  to  precedence  constraints.     Here,   each  task 
i     may  have  both  an  earliest  time  to  begin  execution     r.     and  a  deadline     d. .     If  all     r. 's 
are  zero,   then  this  constraint  is  the  same  as  the  upper1bound  constraint.     However,  suppose 
r^  <   d^  <   r2  <   d_ •     This  inequality  would  imply  the  simple  precedence  relation  that  task  1 
must  precede  task  2.     In  general,   the     [r,,  intervals  will  be  overlapping.  However, 

with  a  large  number  of  tasks  the  problem  of  determining  whether  all  the  time  constraints  can 
be  met  is  difficult.     McMahon  and  Florian   [13],   and  Baker  and  Su   [14]  discuss  methods  of 
solving  this  problem. 

Example 

In  this  section  we  present  an  example  to  demonstrate  how  the  SDP  methodology  can  be  used 
to  solve  a  simple  signal  processing  problem.     The  problem  consists  of  a  specified  set  of 
software  and  hardware,  with  associated  data.     The  assignment  of  software  to  hardware  will  be 
done  to  minimize  the  software  cycle  time. 

Specifically,   the  software  consists  of  six  communicating  tasks,   each  of  which  can  be  con- 
sidered as  being  a  different  phase  of  a  complete  signal  processing  operation.     Each  task 
operates  on  a  given  set  of  sensor  signals  received  by  the  hardware.     For  example,   task  1 
might  compress  or  reformat  the  incoming  data,   task  2  could  pass  the  data  through  a  prelimi- 
nary filter,   task  3  might  initiate  the  central  data  analysis  operations,  etc. 

Each  task  is  strictly  constrained  to  begin  execution  after  its  preceding  task  has  com- 
pleted execution.     However,   there  may  exist  two  types  of  delays  which  prevent  the  successor's 
immediate  execution.     First,   the  predecessor  must  communicate  with  its  successor,  which,  if 
the  two  tasks  are  not  assigned  to  the  processor,   requires  bus  access  and  its  inherent  time 
delays.     Secondly,   there  may  exist  "dead  time"  between  the  execution  of  two  tasks.  Typically, 
such  time  intervals  result  from  execution  constraints  placed  by  the  other  tasks  operating  in 
the  system   (e.g.,   all  the  other  processors  may  be  fully  loaded). 

The  data  used  in  this  example  are  detailed  below: 

* 

1)  Each  PE  has  a  throughput  of  0.5  MIPs; 

2)  Each  PE  has  a  capacity  of  64K  bytes; 

3)  Each  task  requires  32K  bytes  of  memory;  and 

4)  Each  pair  of  communicating  tasks  requires  0.05  ms  to  communicate  if  they  are  not  exe- 
cuted by  the  same  PE. 

Shown  in  Table  I  is  the  number  of  instructions  each  task  executes. 

Shown  in  Figure  4  are  the  feasible  task-to-PE  allocations.  The  sparsity  is  induced  pri- 
marily by  the  memory  constraints,  which  require  that  not  more  than  two  tasks  be  assigned  to 
a  single  PE. 

Finally  we  introduce  two  constraints,   both  of  which  are  realistic  in  the  current  context. 
First,   no  preemption  of  tasks  executing  on  a  PE  is  allowed.     Secondly,   if  a  PE  executes  two 

♦Identical  PE's  were  used  for  simplicity.     The  SDP  methodology  accommodates  non-identical  PE's 
with  no  additional  computational  overhead. 
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TASK  NUMBER 

NUMBER  OF  INSTRUCTIONS 

1 

100 

2 

75 

3 

150 

A 
4 

^  n 

D  U 

5 

50 

6 

150 

Table  1. 

Task  execution  data 

PE's 


fASI  S 


Figure  4.     Task-to-processor  assignment  graph. 

tasks,    then  they  must  be  consecutive  ones.     Clearly,   the  operation  of  the  system  would  be 
degraded  if  this  constraint  were  not  imposed.     Consider  the  following  sequence  of  events 
which  could  then  occur. 


1)     PEl  executes  task  k; 


2)     Task     k     communicates  with  task     k+1     via  the  bus  connecting  PEl  with  PE2; 


3)     PE2  executes  task  k+1; 


4)     Task     k     communicates  with  task     k+2     via  the    (same)    bus;  and 


5)     PEl  executes  task  k+2. 


It  is  straightforward  to  show  that  the  above  sequence  degrades  possible  system  performance 
regardless  of  where  the  other  tasks  in  the  system  are  executing.     Finally,  with  the  intro- 
duction of  the  latter  constraint,   the  cycle  time  becomes  exactly  the  same  as  the  sum  of  the 
maximum  processing  time  and  the  communication  time  for  the  processors. 

The  solution  to  this  problem  was  found  using  the  SDP  method  in  less  than  0.02   seconds  on 
a  VAX  11/780.     The    (optimal  solution)    is  to  allocate  tasks  1  and  2  to  processor  1,   task  3 
to  processor  2,   tasks  4  and  5  to  processor  3,   and  task  6  to  processor  4.     The  resulting 
cycle  time  is  0.4ms.     A  time  line  showing  what  each  processor  is  working  on  at  all  times  is 
shown  in  Figure  5. 


0 . 5ms  1 . Omc 

Figure  5.     Time  line  of  task  processing. 
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Conclus  ions 


In  this  paper  we  have  summarized  an  efficient  method  of  allocating  software  tasks  to  the 
processing  elements  of  a  distributed  signal  processor.     The  method  was  based  upon  spatial 
dynamic  programming  and  was  used  to  determine  optimal  allocations   for  software  involving 
inter-task  communication,   task  memory  requirements,   and  time  constraints  on  cycle  times. 

The  method  developed  was  applied  to  a  simplified  distributed  signal  processor  and  found 
to  hold  potential  for  allocating  tasks  in  such  a  system.     A  possible  application  of  the 
method  is  to  analyze  candidate  signal  processor  architectures.     Here,   the  method  can  be  used 
to  determine  competing  operational  aspects  and  assist  in  selection  of  an  architecture  which 
minimizes  the  cost-to-performance  ratio.     The  potential  for  the  method's  future  application 
to  signal  processor  design  is  due  to  the  sparsity  of  the  software  interactions;   an  attribute 
which  is  very  effectively  exploited  by  the  method. 

Currently,   research  is  being  done  on  expanding  the  model  to  encompass  software  with 
stochastic  execution  times  and  communication  volumes. 
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